CN108564940B - Speech recognition method, server and computer-readable storage medium - Google Patents

Speech recognition method, server and computer-readable storage medium Download PDF

Info

Publication number
CN108564940B
CN108564940B CN201810227474.8A CN201810227474A CN108564940B CN 108564940 B CN108564940 B CN 108564940B CN 201810227474 A CN201810227474 A CN 201810227474A CN 108564940 B CN108564940 B CN 108564940B
Authority
CN
China
Prior art keywords
model
neural network
speech
phoneme
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810227474.8A
Other languages
Chinese (zh)
Other versions
CN108564940A (en
Inventor
梁浩
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810227474.8A priority Critical patent/CN108564940B/en
Priority to PCT/CN2018/102204 priority patent/WO2019179034A1/en
Publication of CN108564940A publication Critical patent/CN108564940A/en
Application granted granted Critical
Publication of CN108564940B publication Critical patent/CN108564940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The invention discloses a voice recognition method, which comprises the following steps: constructing an acoustic model; when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part; extracting acoustic features from the valid speech portion; inputting the acoustic features into an acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, and outputting a recognition result to a trained hybrid neural network model based on memory unit connection; and outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection. The invention also provides a server and a computer readable storage medium. The voice recognition method, the server and the computer readable storage medium provided by the invention can improve the accuracy of voice recognition.

Description

Speech recognition method, server and computer-readable storage medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, a server, and a computer-readable storage medium.
Background
Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims to make a machine change Speech signals into words by Recognition and understanding, and is an important branch of the development of modern artificial intelligence. The realization of the voice recognition technology is the premise of natural language processing, the development of the voice control interaction related field can be effectively promoted, and the voice recognition technology is greatly convenient for people's life, such as intelligent home and voice input, so that some people who are not suitable for using hands and eyes, such as middle-aged and old people, and environments, such as driving, on the road and other scenes, can carry out command operation, and the realization is realized. The accuracy of speech recognition directly determines the effectiveness of the technical application, but the accuracy of current speech recognition does not meet the user's requirements.
Disclosure of Invention
In view of the above, the present invention provides a speech recognition method, a server and a computer readable storage medium, which can improve the accuracy of speech recognition.
First, to achieve the above object, the present invention provides a speech recognition method, including:
constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection;
when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part;
extracting acoustic features from the valid speech portion;
inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained memory unit connection-based hybrid neural network model;
and outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection.
Optionally, the step of preprocessing the speech signal to extract an effective speech part when the speech signal is acquired specifically includes:
pre-emphasizing the speech signal to boost high frequency portions in the speech signal;
framing and windowing the speech signal to convert a non-stationary signal to a short-time stationary signal;
and removing the noise of the short-time stationary signal, and extracting an effective voice part, wherein the effective voice part is the short-time stationary signal in a preset frequency.
Optionally, the step of extracting acoustic features from the valid speech part specifically includes:
fourier transforming the effective speech portion to convert the speech portion in the time domain to an energy spectrum in the frequency domain;
according to the energy spectrum, highlighting formant features of the voice part through a set of Mel-scale triangular filter banks;
and obtaining acoustic characteristics by performing discrete cosine transform on the energy spectrum output by the triangular filter bank.
Optionally, the training models of phonemes include a monophonic phoneme model and a triphone model, and the step of inputting the acoustic features into the acoustic model, recognizing the acoustic features through the training models of phonemes, and outputting the recognition result to the memory unit connection-based hybrid neural network model specifically includes:
comparing the similarity of different phoneme pronunciations according to the acoustic characteristics through the single-phoneme model, and outputting an alignment result to the triphone model;
combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result;
and outputting the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.
Optionally, the acoustic feature is mfcc (mel frequency cepstrum coefficient).
In addition, to achieve the above object, the present invention further provides a server, which includes a memory and a processor, wherein the memory stores a speech recognition system operable on the processor, and the speech recognition system implements the following steps when executed by the processor:
constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection;
when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part;
extracting acoustic features from the valid speech portion;
inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained memory unit connection-based hybrid neural network model;
and outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection.
Optionally, the step of preprocessing the speech signal to extract an effective speech part when the speech signal is acquired specifically includes:
pre-emphasizing the speech signal to boost high frequency portions in the speech signal;
framing and windowing the speech signal to convert a non-stationary signal to a short-time stationary signal;
and removing the noise of the short-time stationary signal, and extracting an effective voice part, wherein the effective voice part is the short-time stationary signal in a preset frequency.
Optionally, the step of extracting acoustic features from the valid speech part specifically includes:
fourier transforming the effective speech portion to convert the speech portion in the time domain to an energy spectrum in the frequency domain;
according to the energy spectrum, highlighting formant features of the voice part through a set of Mel-scale triangular filter banks;
and obtaining an acoustic feature by discrete cosine transform of the energy spectrum output by the triangular filter bank, wherein the acoustic feature is MFCC (mel frequency cepstrum coefficient).
Optionally, the training models of phonemes include a monophonic phoneme model and a triphone model, and the step of inputting the acoustic features into the acoustic model, recognizing the acoustic features through the training models of phonemes, and outputting the recognition result to the memory unit connection-based hybrid neural network model specifically includes:
comparing the similarity of different phoneme pronunciations according to the acoustic characteristics through the single-phoneme model, and outputting an alignment result to the triphone model;
combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result;
and outputting the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a speech recognition system, which can be executed by at least one processor to cause the at least one processor to execute the steps of the speech recognition method as described above.
Compared with the prior art, the server, the voice recognition method and the computer-readable storage medium provided by the invention have the advantages that the acoustic model constructed by the server, the voice recognition method and the computer-readable storage medium comprises a phoneme training model and a hybrid neural network model. The mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM which are connected based on memory units, speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, recognized phonemes corresponding to input voice information are output in a classified mode, and accuracy of phoneme recognition can be effectively improved. When an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part, and extracting acoustic features from the effective voice part; then, the acoustic features are input into the acoustic model, the model is trained through the trained phonemes to perform phoneme recognition on the acoustic features, and a recognition result is output to the trained memory unit connection-based hybrid neural network model. And finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.
Drawings
FIG. 1 is a schematic diagram of an alternative hardware architecture for a server according to the present invention;
FIG. 2 is a schematic diagram of program modules of a first embodiment of the speech recognition system of the present invention;
FIG. 3 is a schematic diagram of program modules of a second embodiment of the speech recognition system of the present invention;
FIG. 4 is a flowchart illustrating a first embodiment of a speech recognition method according to the present invention;
FIG. 5 is a flowchart illustrating a speech recognition method according to a second embodiment of the present invention.
Reference numerals:
Figure GDA0002150984740000051
Figure GDA0002150984740000061
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 is a schematic diagram of an alternative hardware architecture of the server 2. In this embodiment, the server 2 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. It is noted that fig. 1 only shows the server 2 with components 11-13, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the server 2, such as a hard disk or a memory of the server 2. In other embodiments, the memory 11 may also be an external storage device of the server 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the server 2. Of course, the memory 11 may also comprise both an internal storage unit of the server 2 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the server 2 and various types of application software, such as program codes of the speech recognition system 200. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the server 2, such as performing control and processing related to data interaction or communication with the terminal device 1. In this embodiment, the processor 12 is configured to operate the program codes stored in the memory 11 or process data, such as operating the speech recognition system 200.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing communication connection between the server 2 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the server 2 with one or more other electronic devices through a network, and establish a data transmission channel and a communication connection between the server 2 and the electronic devices.
The application environment and the hardware structure and function of the related devices of the various embodiments of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described application environment and related devices.
First, the present invention provides a speech recognition system 200.
Referring to FIG. 2, a program module diagram of a first embodiment of a speech recognition system 200 according to the present invention is shown.
In this embodiment, the speech recognition system 200 includes a series of computer program instructions stored on the memory 11 that, when executed by the processor 12, may perform speech recognition operations according to embodiments of the present invention. In some embodiments, the speech recognition system 200 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 2, the speech recognition system 200 may be partitioned into a construction module 201, a processing module 202, an extraction module 203, a recognition module 204, and an output module 205. Wherein:
the building module 201 is configured to build an acoustic model, where the acoustic model includes a phoneme training model and a mixed neural network model based on memory unit connection.
Specifically, the server 2 constructs an acoustic model through the construction module 201. In this embodiment, the model includes a phoneme training model and a memory unit connection-based automatic speech recognition technology (CLDNN) of a hybrid Neural Network, which uses a Long-and-short-time recurrent Neural Network (HLSTM-RNN) based on memory unit connection, a high-way Long-and-time recurrent Neural Network (HLSTM-RNN), a Convolutional Neural Network (CNN), a Feed-forward Neural Network (DNN), and a Hidden Markov Model (HMM), and the integrated Deep hybrid Neural Network model is used as a model, and the CNN-HMM is used to reduce speaker difference (different from the difference of a speaker concerned with voiceprint recognition, and the speech recognition is concerned about the content itself, i.e. non-difference), and the acoustic timing information captured by the RNN-LSTM-HMM (acoustic modeling history information in a speech is used to perform context), and then, distinguishing different phonemes through the DNN-HMM, and classifying and outputting the recognized phonemes corresponding to the input voice information. For example, if the user is only the word with the pronunciation of "gong-shi", it is difficult to determine which word is specific, and the pronunciation of each user has a certain difference, such as mandarin chinese homophones, e.g. (formula, work, notations, attack). The time sequence information of the voice is captured through the RNN-LSTM-HMM, and if the captured time sequence information is 'the maximum value of time calculated by using a formula', the 'gong-shi' in the sentence can be determined to be the word of the 'formula' through the context information of the 'gong-shi'.
The processing module 202 is configured to, when an original voice signal is obtained, pre-process the voice signal to extract an effective voice portion.
Specifically, when acquiring an original voice signal, the server 2 preprocesses the original voice signal through the processing module 202. In this embodiment, the pre-processing module 202 first pre-emphasizes the original speech signal to enhance the high frequency portion of the speech signal, so that the frequency spectrum is smoother. The processing module 202 then performs frame windowing on the pre-emphasized speech signal to convert the non-stationary speech signal into a short-time stationary signal. Further, the processing module 202 distinguishes between speech and noise through endpoint detection to remove noise in the short-time stationary signal and extract a valid speech portion. The frequency of human voice is about 65-1100 Hz, in this embodiment, the processing module 202 may set a preset frequency in a frequency range of 65-1100 Hz, remove sounds (i.e. noise) outside the preset frequency range, and extract a short-time stationary signal within the preset frequency range.
The extracting module 203 is configured to extract an acoustic feature from the valid speech portion as an input of the acoustic model.
In particular, the server 2 extracts acoustic features from the valid speech portion by means of the extraction module 203. In this embodiment, the extraction module 203 first performs fourier transform on the effective speech portion to convert the speech signal in the time domain into an energy spectrum in the frequency domain. The extraction module 203 then passes the energy spectrum through a set of mel-scale triangular filter banks to highlight formant features of speech. Further, the logarithmic energy of each filter bank output is calculated. After logarithmic energy calculation, the energy spectrum output by the triangular filter bank is subjected to discrete cosine transform to obtain MFCC coefficients (mel frequency cepstrum coefficient), namely MFCC acoustic characteristics.
The recognition module 204 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features and outputs a recognition result to the trained memory unit connection-based hybrid neural network model.
Specifically, the recognition module 204 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features, wherein the phoneme recognition mainly includes recognition of words and phrases in sentences. In the process of speech recognition, according to the occurrence probability of states in the HMM, namely the comparison similarity of different pronunciations, a path with the maximum occurrence probability is selected in a decoding network as a final output result.
The output module 205 is configured to output text information corresponding to the speech information according to the received recognition result through the trained memory unit connection-based hybrid neural network model.
Specifically, the server 2 outputs text information corresponding to the voice information according to the received recognition result through the trained memory unit connection-based hybrid neural network model. In this embodiment, all nodes of the hybrid neural network model based on the memory unit connection are initialized by uniform random weights in the range of [ -0.05 ], and bias is initially 0. The training of the neural network adopts a cross entropy evaluation criterion (CE: a training evaluation criterion reflecting training output and standard) and an optimized training method of back-propagation-through-time (BPTT) along time truncation. Wherein, each segment of the model contains 20 frames of information, and each minimatch contains 40 pronunciation sentences. Furthermore, in the selection of the momentum factor (momentum: a variable controlling the acceleration of neural network training), the first epoch takes 0, followed by 0.9.
Through the program module 201 and 205, the speech recognition system 200 provided by the present invention first constructs an acoustic model, wherein the acoustic model includes a phoneme training model and a hybrid neural network model based on memory unit connection; then, when an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part; further, extracting acoustic features from the valid speech portion; then, inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained hybrid neural network model based on memory unit connection; and finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.
Further, based on the above-described first embodiment of the speech recognition system 200 of the present invention, a second embodiment of the present invention is proposed (as shown in fig. 3). In this embodiment, the speech recognition system 200 further comprises a comparison module 206 and a combination module 207, and the phoneme training models comprise a single-phoneme model and a triple-phoneme model, wherein,
the comparing module 206 is configured to compare the similarity of different phone pronunciations according to the acoustic features through the single-phone model, and output a single-factor alignment result to the triple-phone model through the output module 205.
Generally, each person's pronunciation is different or has a local accent resulting in a substandard pronunciation, and therefore, in the present embodiment, the comparison module 206 compares the similarity of different phoneme pronunciations with dictionary phonemes (standard pronunciations) according to the acoustic features through the monophonic phoneme model, and outputs a monophonic alignment result to the triphone model through the output module 205.
In this embodiment, the process of training the monophonic phone model is as follows: firstly, normalizing input acoustic features, and normalizing variance by default; further, an initialized HMM-GMM model and a decision tree are obtained by utilizing acoustic feature data; then constructing a network for training, constructing an FST network at a phoneme level for decoding each sentence, repeatedly training by continuously aligning feature sequences in the training process to obtain an intermediate statistic, wherein the statistic of an HMM is the occurrence frequency of edges Arc of two phonemes connected in the FST network, the statistic of a GMM is a feature accumulated value and a feature square accumulated value corresponding to each pdf-id, the statistic is related to the updating of two sufficient statistics of the mean of variance of the GMM, and the training of the decoding network is completed by continuously updating a model; and finally, performing forced alignment again for decoding an output result or performing model training of the next stage.
The combining module 207 is configured to combine the influence of the front and rear related phonemes of the current phoneme through the triphone model, and output a forced phoneme alignment result.
Specifically, the triphone model aligns the phonemes one by one, and combines the influence of the current phoneme, that is, the front and back related phonemes of the current aligned phoneme, so as to obtain a more accurate alignment effect, and generate a better recognition result. For example, mandarin is all the same in the same syllable characters (quiet, clean, competitive) and there are homophones, such as (formula, work, show, attack). Through three-factor training, the influence of the front and back related phonemes of the current phoneme, namely the current context and the upper and lower text information can be combined, and the effect of the current phoneme is more accurate. For example, if the user says "zen-me-li-yong-gong-shi-zheng-ming-dig-shi-cheng-li? "when the current phoneme being recognized is" gong-shi ", because there are many homophones for gong-shi, the" gong-shi "can be determined as a formula by the context of the relevant phoneme of" zheng-ming "(proving) and deng-shi (equation) in combination with the context.
In this embodiment, the triphone model training is performed based on delta + delta-delta feature transformation, where the delta + delta-delta feature transformation is performed by adding a delta feature to an original MFCC feature (delta is to perform derivation on the MFCC feature, and delta-delta is to perform second derivation in the same way), the original MFCC feature is 13-dimensional, and an input feature becomes 39-dimensional after the delta + delta-delta feature is added.
The output module 205 is further configured to output the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.
Specifically, the server 2 outputs the forced phoneme alignment result to the memory unit connection-based hybrid neural network model through the output module 205.
In this embodiment, the processing of the hybrid neural network model based on memory cell connections is: and outputting the forced phoneme alignment result to a CNN model. The CNN model has a convolutional layer (convolutional layer) containing 256 convolutional kernels (size 1 × 8 matrix size), each convolutional kernel generating 1 feature map (feature map) for extracting different features; the CNN model also has a non-overlapping pooling layer (max-pooling layer), whose window size and stride (representing that the input is moved one step each time during convolution, both top and bottom) are both 3, and pooling serves to reduce dimensionality of the output result of the convolutional layer. For example, we use 83-dimensional feature vector input here, the parameter becomes 83 × 256 after the first layer of the conditional layer, and becomes (83/3) × 256 after the layer of max-posing layer parameter; despite the reduction of 1/3, the model parameters are too large for speech recognition, for which purpose a projection layer (projection layer) is connected downstream of the max-poolinglayer, the effect of which is to continue the dimensionality reduction, reducing (83/3) × 256 to 256.
Before sending the CNN output to the RNN-LSTM based on the memory cell connection, there is a connection layer (to increase the number of outputs). This is because the feature vector input to the recurrent neural network generally adopts a front-back splicing manner, for example, 5 frames are generally spliced in front of and behind the current frame according to the time sequence for training the timing sequence model RNN, and the CNN is input frame by frame, so that the output of the CNN needs to be conditioned to be input to the LSTM-RNN. The RNN-LSTM model has 3 layers of LSTM layers, each layer of LSTM having 1024 neuron nodes followed by a projection layer with 512 nodes (also for dimensionality reduction). Finally, the output of the last layer of LSTM is input into a fully-connected feed-forward neural network (DNN) model, the DNN model has two layers, each layer has 1024 hidden nodes, the activation function adopts a modified linear activation function (function expression is f (x) ═ max (0, x)), and the result of DNN is then output through a softmax layer for classification and judgment.
Through the program module 206 and 207, the speech recognition system 200 of the present invention can align the phoneme pronunciation through the single-phoneme model, and further, forcibly align the phonemes through the triphone model in combination with the context, thereby improving the accuracy of speech recognition.
In addition, the invention also provides a voice recognition method.
Fig. 4 is a schematic flow chart of the speech recognition method according to the first embodiment of the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 4 may be changed and some steps may be omitted according to different requirements.
Step S301, an acoustic model is constructed, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection.
In particular, the server 2 builds an acoustic model. In this embodiment, the model includes a phoneme training model and a memory unit connection-based automatic speech recognition technology (CLDNN) of a hybrid Neural Network, which uses a Long-and-short-time recurrent Neural Network (HLSTM-RNN) based on memory unit connection, a Feed-forward Neural Network (DNN), and a Hidden Markov Model (HMM), the integrated Deep hybrid Neural Network model is used as a model, speaker difference (different from the difference of a speaker concerned by voiceprint recognition, and the content of speech recognition is not different), and timing information of speech is captured by the RNN-LSTM-HMM (using historical information in a speech to perform context modeling), and then, distinguishing different phonemes through the DNN-HMM, and classifying and outputting the recognized phonemes corresponding to the input voice information. For example, if the user is only the word with the pronunciation of "gong-shi", it is difficult to determine which word is specific, and the pronunciation of each user has a certain difference, such as mandarin chinese homophones, e.g. (formula, work, notations, attack). The time sequence information of the voice is captured through the RNN-LSTM-HMM, and if the captured time sequence information is 'the maximum value of time calculated by using a formula', the 'gong-shi' in the sentence can be determined to be the word of the 'formula' through the context information of the 'gong-shi'.
Step S302, when an original voice signal is obtained, preprocessing is carried out on the voice signal to extract an effective voice part.
Specifically, the server 2 preprocesses the original voice signal when acquiring the original voice signal. In this embodiment, the server 2 first pre-emphasizes the original speech signal to enhance the high frequency part of the speech signal, so that the frequency spectrum is smoother. And then, performing frame division and windowing on the pre-emphasized voice signal to convert the non-stationary voice signal into a short-time stationary signal. Further, voice and noise are distinguished through endpoint detection, so that middle noise of the short-time stationary signal is removed, and an effective voice part is extracted. The frequency of human voice is about 65-1100 Hz, in this embodiment, the processing module 202 may set a preset frequency in a frequency range of 65-1100 Hz, remove sounds (i.e. noise) outside the preset frequency range, and extract a short-time stationary signal within the preset frequency range.
Step S303, extracting acoustic features from the valid speech part as input of the acoustic model.
In particular, the server 2 extracts acoustic features from the valid speech portion. In this embodiment, the server 2 first performs fourier transform on the valid voice part to convert the voice signal in the time domain into an energy spectrum in the frequency domain. The server 2 then passes the energy spectrum through a set of mel-scale triangular filter banks to highlight formant features of speech. Further, the logarithmic energy of each filter bank output is calculated. After logarithmic energy calculation, the energy spectrum output by the triangular filter bank is subjected to discrete cosine transform to obtain MFCC coefficients (mel frequency coefficient), namely MFCC acoustic features.
And step S304, training a model through the trained phonemes to perform phoneme recognition on the acoustic features and outputting a recognition result to the trained memory unit connection-based hybrid neural network model.
Specifically, the server 2 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features, wherein the phoneme recognition mainly comprises recognition of words and phrases in sentences. In the process of speech recognition, according to the occurrence probability of states in the HMM, namely the comparison similarity of different pronunciations, a path with the maximum occurrence probability is selected in a decoding network as a final output result.
Step S305, outputting text information corresponding to the voice information according to the received recognition result through the trained memory unit connection-based hybrid neural network model.
Specifically, the server 2 outputs text information corresponding to the voice information according to the received recognition result through the trained memory unit connection-based hybrid neural network model. In this embodiment, all nodes of the hybrid neural network model based on the memory unit connection are initialized by uniform random weights in the range of [ -0.05 ], and bias is initially 0. The training of the neural network adopts a cross entropy evaluation criterion (CE: a training evaluation criterion reflecting training output and standard) and an optimized training method of back-propagation-through-time (BPTT) along time truncation. Wherein, each segment of the model contains 20 frames of information, and each minimatch contains 40 pronunciation sentences. Furthermore, in the selection of the momentum factor (momentum: a variable controlling the acceleration of neural network training), the first epoch takes 0, followed by 0.9.
Through the steps S301-305, the speech recognition method provided by the invention comprises the steps of firstly, constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection; then, when an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part; further, extracting acoustic features from the valid speech portion; then, inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained hybrid neural network model based on memory unit connection; and finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.
Further, based on the above-described first embodiment of the speech recognition method of the present invention, a second embodiment of the speech recognition method of the present invention is proposed.
Fig. 5 is a flow chart of a speech recognition method according to a second embodiment of the present invention. In this embodiment, the phoneme training model includes a monophonic phoneme model and a triphone model, and the step of performing phoneme recognition on the acoustic features through the trained phoneme training model and outputting a recognition result to the trained hybrid neural network model based on memory unit connection specifically includes the following steps:
step S401, comparing the similarity of different phone pronunciations according to the acoustic features through the single phone model, and outputting a single-factor alignment result to the three-phone model through the output module 205.
Generally, each person's pronunciation differs or has a local accent resulting in a nonstandard pronunciation, and therefore, in the present embodiment, the server 2 compares the similarity of different phoneme pronunciations with dictionary phonemes (standard pronunciations) according to the acoustic features through the monophonic model and outputs a monophonic alignment result to the triphone model.
In this embodiment, the process of training the monophonic phone model is as follows: firstly, normalizing input acoustic features, and normalizing variance by default; further, an initialized HMM-GMM model and a decision tree are obtained by utilizing acoustic feature data; then constructing a network for training, constructing an FST network at a phoneme level for decoding each sentence, repeatedly training by continuously aligning feature sequences in the training process to obtain an intermediate statistic, wherein the statistic of an HMM is the occurrence frequency of edges Arc of two phonemes connected in the FST network, the statistic of a GMM is a feature accumulated value and a feature square accumulated value corresponding to each pdf-id, the statistic is related to the updating of two sufficient statistics of the mean of variance of the GMM, and the training of the decoding network is completed by continuously updating a model; and finally, performing forced alignment again for decoding an output result or performing model training of the next stage.
Step S402, combining the influence of the front and back related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result.
Specifically, the triphone model aligns the phonemes one by one, and combines the influence of the current phoneme, that is, the front and back related phonemes of the current aligned phoneme, so as to obtain a more accurate alignment effect, and generate a better recognition result. For example, mandarin is all the same in the same syllable characters (quiet, clean, competitive) and there are homophones, such as (formula, work, show, attack). Through three-factor training, the influence of the front and back related phonemes of the current phoneme, namely the current context and the upper and lower text information can be combined, and the effect of the current phoneme is more accurate. For example, if the user says "zen-me-li-yong-gong-shi-zheng-ming-dig-shi-cheng-li? "when the current phoneme being recognized is" gong-shi ", because there are many homophones for gong-shi, the" gong-shi "can be determined as a formula by the context of the relevant phoneme of" zheng-ming "(proving) and deng-shi (equation) in combination with the context.
In this embodiment, the triphone model training is performed based on delta + delta-delta feature transformation, where the delta + delta-delta feature transformation is performed by adding a delta feature to an original MFCC feature (delta is to perform derivation on the MFCC feature, and delta-delta is to perform second derivation in the same way), the original MFCC feature is 13-dimensional, and an input feature becomes 39-dimensional after the delta + delta-delta feature is added.
Step S403, outputting the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.
Specifically, the server 2 outputs the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.
In this embodiment, the processing of the hybrid neural network model based on memory cell connections is: and outputting the forced phoneme alignment result to a CNN model. The CNN model has a convolutional layer (convolutional layer) containing 256 convolutional kernels (size 1 × 8 matrix size), each convolutional kernel generating 1 feature map (feature map) for extracting different features; the CNN model also has a non-overlapping pooling layer (max-pooling layer), whose window size and stride (representing that the input is moved one step each time during convolution, both top and bottom) are both 3, and pooling serves to reduce dimensionality of the output result of the convolutional layer. For example, we use 83-dimensional feature vector input here, the parameter becomes 83 × 256 after the first layer of the conditional layer, and becomes (83/3) × 256 after the layer of max-posing layer parameter; despite the reduction of 1/3, the model parameters are too large for speech recognition, for which purpose a projection layer (projection layer) is connected downstream of the max-poolinglayer, the effect of which is to continue the dimensionality reduction, reducing (83/3) × 256 to 256.
Before sending the CNN output to the RNN-LSTM based on the memory cell connection, there is a connection layer (to increase the number of outputs). This is because the feature vector input to the recurrent neural network generally adopts a front-back splicing manner, for example, 5 frames are generally spliced in front of and behind the current frame according to the time sequence for training the timing sequence model RNN, and the CNN is input frame by frame, so that the output of the CNN needs to be conditioned before being input to the LSTM-RNN. The RNN-LSTM model has 3 layers of LSTM layers, each layer of LSTM having 1024 neuron nodes followed by a projection layer with 512 nodes (also for dimensionality reduction). Finally, the output of the last layer of LSTM is input into a fully-connected feed-forward neural network (DNN) model, the DNN model has two layers, each layer has 1024 hidden nodes, the activation function adopts a modified linear activation function (function expression is f (x) ═ max (0, x)), and the result of DNN is then output through a softmax layer for classification and judgment.
Through the steps S401 to S403, the speech recognition method provided by the present invention can align the pronunciation of the phoneme through the single phoneme model, and further, forcibly align the phoneme through the triphone model in combination with the context, thereby improving the accuracy of speech recognition.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A speech recognition method applied to a server is characterized by comprising the following steps:
constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model, the mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM which are connected based on memory units, the speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, and the recognized phonemes corresponding to input voice information are output in a classified mode;
when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part;
extracting acoustic features from the valid speech portion;
inputting the acoustic features into the acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, selecting a path with the maximum occurrence probability in a decoding network as a recognition result according to the occurrence probability of a state in a Hidden Markov Model (HMM) in the mixed neural network model, outputting the recognition result to an RNN-LSTM model based on memory unit connection in the mixed neural network model, and inputting the output of the last layer of LSTM to the feedforward neural network (DNN);
and outputting text information corresponding to the voice information.
2. The speech recognition method according to claim 1, wherein the step of preprocessing the speech signal to extract an effective speech part when the original speech signal is acquired specifically comprises:
pre-emphasizing the speech signal to boost high frequency portions in the speech signal;
framing and windowing the speech signal to convert a non-stationary signal to a short-time stationary signal;
and removing the noise of the short-time stationary signal, and extracting an effective voice part, wherein the effective voice part is the short-time stationary signal in a preset frequency.
3. The speech recognition method of claim 2, wherein the step of extracting acoustic features from the valid speech portion comprises:
fourier transforming the effective speech portion to convert the speech portion in the time domain to an energy spectrum in the frequency domain;
according to the energy spectrum, highlighting formant features of the voice part through a set of Mel-scale triangular filter banks;
and obtaining acoustic characteristics by performing discrete cosine transform on the energy spectrum output by the triangular filter bank.
4. The speech recognition method of any one of claims 1-3, wherein the phoneme training models comprise a monophonic model and a triphone model, and the selecting the path with the highest probability of occurrence in the decoding network as the recognition result is output to the RNN-LSTM model based on the memory unit connections in the hybrid neural network model further comprises:
comparing the similarity of different phoneme pronunciations according to the acoustic characteristics through the single-phoneme model, and outputting an alignment result to the triphone model;
combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result to a CNN model in the mixed neural network model;
and outputting the output result of the CNN model to the RNN-LSTM model.
5. The speech recognition method of claim 4, wherein the acoustic feature is MFCC (Mel frequency cepstrum coefficient).
6. A server, comprising a memory, a processor, the memory having stored thereon a speech recognition system operable on the processor, the speech recognition system when executed by the processor performing the steps of:
constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection, the mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM based on memory unit connection, speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, and the recognized phonemes corresponding to input voice information are output in a classified mode;
when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part;
extracting acoustic features from the valid speech portion;
inputting the acoustic features into the acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, selecting a path with the maximum occurrence probability in a decoding network as a recognition result according to the occurrence probability of a state in a Hidden Markov Model (HMM) in the mixed neural network model, outputting the path with the maximum occurrence probability to an RNN-LSTM model based on memory unit connection in the mixed neural network model, and inputting the output of the last layer of LSTM to the feedforward neural network DNN;
and outputting text information corresponding to the voice information.
7. The server according to claim 6, wherein the step of preprocessing the speech signal to extract an effective speech part when the original speech signal is acquired specifically includes:
pre-emphasizing the speech signal to boost high frequency portions in the speech signal;
framing and windowing the speech signal to convert a non-stationary signal to a short-time stationary signal;
and removing the noise of the short-time stationary signal, and extracting an effective voice part, wherein the effective voice part is the short-time stationary signal in a preset frequency.
8. The server according to claim 6, wherein the step of extracting acoustic features from the valid speech portion comprises:
fourier transforming the effective speech portion to convert the speech portion in the time domain to an energy spectrum in the frequency domain;
according to the energy spectrum, highlighting formant features of the voice part through a set of Mel-scale triangular filter banks;
and obtaining an acoustic feature by performing discrete cosine transform on the energy spectrum output by the triangular filter bank, wherein the acoustic feature is MFCC (mel frequency cepstrum coefficient).
9. The server according to any one of claims 7-8, wherein the phoneme training models comprise a monophonic model and a triphone model, and the selecting a path with the highest probability of occurrence in the decoding network as the recognition result is output to the RNN-LSTM model based on memory unit connections in the hybrid neural network model further comprises:
comparing the similarity of different phoneme pronunciations according to the acoustic characteristics through the single-phoneme model, and outputting an alignment result to the triphone model;
combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result to a CNN model in the mixed neural network model;
and outputting the output result of the CNN model to the RNN-LSTM model based on the memory unit connection.
10. A computer-readable storage medium storing a speech recognition system executable by at least one processor to cause the at least one processor to perform the steps of the speech recognition method according to any one of claims 1-5.
CN201810227474.8A 2018-03-20 2018-03-20 Speech recognition method, server and computer-readable storage medium Active CN108564940B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810227474.8A CN108564940B (en) 2018-03-20 2018-03-20 Speech recognition method, server and computer-readable storage medium
PCT/CN2018/102204 WO2019179034A1 (en) 2018-03-20 2018-08-24 Speech recognition method, server and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810227474.8A CN108564940B (en) 2018-03-20 2018-03-20 Speech recognition method, server and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN108564940A CN108564940A (en) 2018-09-21
CN108564940B true CN108564940B (en) 2020-04-28

Family

ID=63531769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810227474.8A Active CN108564940B (en) 2018-03-20 2018-03-20 Speech recognition method, server and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN108564940B (en)
WO (1) WO2019179034A1 (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147775A (en) * 2018-10-18 2019-01-04 深圳供电局有限公司 A kind of audio recognition method neural network based and device
CN111210805A (en) * 2018-11-05 2020-05-29 北京嘀嘀无限科技发展有限公司 Language identification model training method and device and language identification method and device
CN109376264A (en) * 2018-11-09 2019-02-22 广州势必可赢网络科技有限公司 A kind of audio-frequency detection, device, equipment and computer readable storage medium
CN111191668B (en) * 2018-11-15 2023-04-28 零氪科技(北京)有限公司 Method for identifying disease content in medical record text
CN109525787B (en) * 2018-12-13 2021-03-16 南京邮电大学 Live scene oriented real-time subtitle translation and system implementation method
CN109616111B (en) * 2018-12-24 2023-03-14 北京恒泰实达科技股份有限公司 Scene interaction control method based on voice recognition
CN111402870B (en) * 2019-01-02 2023-08-15 中国移动通信有限公司研究院 Voice recognition method, device and equipment
CN109448726A (en) * 2019-01-14 2019-03-08 李庆湧 A kind of method of adjustment and system of voice control accuracy rate
CN109767765A (en) * 2019-01-17 2019-05-17 平安科技(深圳)有限公司 Talk about art matching process and device, storage medium, computer equipment
CN111489745A (en) * 2019-01-28 2020-08-04 上海菲碧文化传媒有限公司 Chinese speech recognition system applied to artificial intelligence
CN109767759B (en) * 2019-02-14 2020-12-22 重庆邮电大学 Method for establishing CLDNN structure applied to end-to-end speech recognition
CN110111774A (en) * 2019-05-13 2019-08-09 广西电网有限责任公司南宁供电局 Robot voice recognition methods and device
CN110189749B (en) * 2019-06-06 2021-03-19 四川大学 Automatic voice keyword recognition method
CN110211591B (en) * 2019-06-24 2021-12-21 卓尔智联(武汉)研究院有限公司 Interview data analysis method based on emotion classification, computer device and medium
CN111127699A (en) * 2019-11-25 2020-05-08 爱驰汽车有限公司 Method, system, equipment and medium for automatically recording automobile defect data
CN112990208A (en) * 2019-12-12 2021-06-18 搜狗(杭州)智能科技有限公司 Text recognition method and device
CN110970036B (en) * 2019-12-24 2022-07-12 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment
CN113270091B (en) * 2020-02-14 2024-04-16 声音猎手公司 Audio processing system and method
CN113360869A (en) * 2020-03-04 2021-09-07 北京嘉诚至盛科技有限公司 Method for starting application, electronic equipment and computer readable medium
CN111354344B (en) * 2020-03-09 2023-08-22 第四范式(北京)技术有限公司 Training method and device of voice recognition model, electronic equipment and storage medium
CN111402891B (en) * 2020-03-23 2023-08-11 抖音视界有限公司 Speech recognition method, device, equipment and storage medium
CN113571054B (en) * 2020-04-28 2023-08-15 中国移动通信集团浙江有限公司 Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN111798841B (en) * 2020-05-13 2023-01-03 厦门快商通科技股份有限公司 Acoustic model training method and system, mobile terminal and storage medium
CN111951796B (en) * 2020-08-19 2024-03-12 北京达佳互联信息技术有限公司 Speech recognition method and device, electronic equipment and storage medium
CN112216270B (en) * 2020-10-09 2024-02-06 携程计算机技术(上海)有限公司 Speech phoneme recognition method and system, electronic equipment and storage medium
CN112651429B (en) * 2020-12-09 2022-07-12 歌尔股份有限公司 Audio signal time sequence alignment method and device
CN112614485A (en) * 2020-12-30 2021-04-06 竹间智能科技(上海)有限公司 Recognition model construction method, voice recognition method, electronic device, and storage medium
CN112885370A (en) * 2021-01-11 2021-06-01 广州欢城文化传媒有限公司 Method and device for detecting validity of sound card
CN113299270A (en) * 2021-05-20 2021-08-24 平安科技(深圳)有限公司 Method, device and equipment for generating voice synthesis system and storage medium
CN113327616A (en) * 2021-06-02 2021-08-31 广东电网有限责任公司 Voiceprint recognition method and device, electronic equipment and storage medium
CN113658599A (en) * 2021-08-18 2021-11-16 平安普惠企业管理有限公司 Conference record generation method, device, equipment and medium based on voice recognition
CN113870848B (en) * 2021-12-02 2022-04-26 深圳市友杰智新科技有限公司 Method and device for constructing voice modeling unit and computer equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10235991B2 (en) * 2016-08-09 2019-03-19 Apptek, Inc. Hybrid phoneme, diphone, morpheme, and word-level deep neural networks
CN107785015A (en) * 2016-08-26 2018-03-09 阿里巴巴集团控股有限公司 A kind of audio recognition method and device
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107680582B (en) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 Acoustic model training method, voice recognition method, device, equipment and medium
CN107680602A (en) * 2017-08-24 2018-02-09 平安科技(深圳)有限公司 Voice fraud recognition methods, device, terminal device and storage medium

Also Published As

Publication number Publication date
CN108564940A (en) 2018-09-21
WO2019179034A1 (en) 2019-09-26

Similar Documents

Publication Publication Date Title
CN108564940B (en) Speech recognition method, server and computer-readable storage medium
US8930196B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
Ghai et al. Literature review on automatic speech recognition
US8762142B2 (en) Multi-stage speech recognition apparatus and method
US9165555B2 (en) Low latency real-time vocal tract length normalization
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
CN109036381A (en) Method of speech processing and device, computer installation and readable storage medium storing program for executing
CN111341325A (en) Voiceprint recognition method and device, storage medium and electronic device
WO2003010753A1 (en) Pattern recognition using an observable operator model
CN106548775B (en) Voice recognition method and system
Mouaz et al. Speech recognition of moroccan dialect using hidden Markov models
EP1675102A2 (en) Method for extracting feature vectors for speech recognition
US20100057462A1 (en) Speech Recognition
Ranjan et al. Isolated word recognition using HMM for Maithili dialect
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
Boite et al. A new approach towards keyword spotting.
Sinha et al. On the use of pitch normalization for improving children's speech recognition
JP3535292B2 (en) Speech recognition system
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
Sharma et al. Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: an overview and review of current state of the art
CN114171009A (en) Voice recognition method, device, equipment and storage medium for target equipment
CN112216270A (en) Method and system for recognizing speech phonemes, electronic equipment and storage medium
Sai et al. Enhancing pitch robustness of speech recognition system through spectral smoothing
Khalifa et al. Statistical modeling for speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant