CN110335591A - A kind of parameter management method, device, machine readable media and equipment - Google Patents
A kind of parameter management method, device, machine readable media and equipment Download PDFInfo
- Publication number
- CN110335591A CN110335591A CN201910600722.3A CN201910600722A CN110335591A CN 110335591 A CN110335591 A CN 110335591A CN 201910600722 A CN201910600722 A CN 201910600722A CN 110335591 A CN110335591 A CN 110335591A
- Authority
- CN
- China
- Prior art keywords
- parameter management
- phonetic feature
- layer
- multidimensional
- pfsmn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title claims abstract description 50
- 238000012545 processing Methods 0.000 claims abstract description 77
- 238000012549 training Methods 0.000 claims abstract description 62
- 238000000034 method Methods 0.000 claims abstract description 41
- 239000000284 extract Substances 0.000 claims abstract description 16
- 239000010410 layer Substances 0.000 claims description 109
- 230000015654 memory Effects 0.000 claims description 78
- 238000013527 convolutional neural network Methods 0.000 claims description 37
- 230000000875 corresponding effect Effects 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 16
- 238000013507 mapping Methods 0.000 claims description 12
- 230000002596 correlated effect Effects 0.000 claims description 6
- 239000011229 interlayer Substances 0.000 claims description 6
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 238000005070 sampling Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 230000018199 S phase Effects 0.000 description 1
- 125000002015 acyclic group Chemical group 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/06—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
A kind of parameter management method of disclosure of the invention and device, the parameter management method include: to obtain voice data and extract the first phonetic feature from the voice data;Multidimensional processing is carried out to first phonetic feature, exports multidimensional processing result;Wherein, the multidimensional processing includes at least one of: noise robustness processing, voice context processing;The multidimensional processing result is trained using corresponding training method, obtains Multidimensional object parameter information.The present invention can learn the additional information that cannot learn to other networks, can also obtain good effect under noise circumstance and other complex environments.
Description
Technical field
The invention belongs to field of speech recognition, and in particular to a kind of parameter management method, device, machine readable media and set
It is standby.
Background technique
Speech recognition system is mainly made of acoustic model, language model and decoder.Wherein, acoustic model is mainly responsible for
Conversion of the original audio to pronunciation sequence;Language model handles the conversion such as text and word grammar;Decoder turns for handling
The problems such as calculation and coded format during changing etc..It is clear that acoustic model plays in speech recognition system
Vital effect.Traditional acoustic model is mainly GMM (Adaptive background mixture models for
Real-time tracking, gauss hybrid models), DNN (Deep Neural Network, deep neural network), CNN
(Convolutional Neural Networks, convolutional neural networks) or LSTM (Long Short-Term Memory,
Shot and long term memory network).But they have the shortcomings that it is respective.DNN captured information is less, and CNN is unable to the letter on pull-in time
Breath, LSTM are difficult to training and to noise not robusts, cause training speed not high, the not high technical problem of discrimination.
Summary of the invention
In view of the foregoing deficiencies of prior art, the present invention provides a kind of parameter management method, device, machine readable Jie
Matter and equipment, for solving the defect that training speed is not high in the prior art and discrimination is not high.
In order to achieve the above objects and other related objects, the present invention provides a kind of parameter management method, the parameter management
Method includes:
It obtains voice data and extracts the first phonetic feature from the voice data;
Multidimensional processing is carried out to first phonetic feature, exports multidimensional processing result;Wherein, the multidimensional, which is handled, includes
At least one of: noise robustness processing, voice context processing;
The multidimensional processing result is trained using corresponding training method, obtains Multidimensional object parameter information.
Optionally, the Multidimensional object parameter information includes mel-frequency cepstrum coefficient MFCC, speaker's correlated characteristic
Ivector。
Optionally, the process that noise robustness processing is carried out to first phonetic feature are as follows:
First phonetic feature is handled by convolutional neural networks CNN, obtains the second phonetic feature.
Optionally, the convolutional neural networks CNN be a residual error network, including laying setting multilayer convolutional layer, two layers
As a unit, a down-sampling layer is arranged in convolutional layer between two units.
Optionally, every two layers of convolution interlayer uses jump connection.
Optionally, the process of voice context processing is carried out to second phonetic feature are as follows:
The timestamp information in the second phonetic feature is extracted by feedforward sequence memory network PFSMN, obtains voice or more
The content that text is included.
Optionally, the PFSMN includes multilayer, and the corresponding memory module Memory Block's of different layers is of different sizes.
Optionally, from small to large according to level, corresponding Memory Block is also from small to large.
Optionally, the lower-level in the pFSMN extracts the phoneme information in the phonetic feature, in the pFSMN
Higher levels extract semanteme, grammar property in the phonetic feature.
Optionally, the pFSMN includes the multilayer forward sequence memory layer of laying setting;Wherein, every layer of forward sequence note
Recall layer to be made of a memory module and a mapping layer.
Optionally, it jumps and connects between every two layers of forward sequence memory layer.
Optionally, the memory module is used to do voice context the related summation of weighting, and the mapping layer is used for will be high
It ties up in specification to lesser bottleneck dimension.
Optionally, the training method includes maximum mutual information training criterion LF-MMI and cross entropy CE multitask training.
Optionally, to the multidimensional processing result using maximum mutual information training criterion LF-MMI and cross entropy CE multitask
Trained mode is trained.
In order to achieve the above objects and other related objects, the present invention also provides a kind of parameter management device, the parametrons
Managing device includes:
Characteristic extracting module, for obtaining voice data and extracting from the voice data the first phonetic feature;
Phonetic feature processing module exports multidimensional processing result for carrying out multidimensional processing to first phonetic feature;
Wherein, the multidimensional processing includes at least one of: noise robustness processing, voice context processing;
Training module obtains multidimensional mesh for being trained to the multidimensional processing result using corresponding training method
Mark parameter information.
Optionally, the Multidimensional object parameter information includes mel-frequency cepstrum coefficient MFCC, speaker's correlated characteristic
Ivector。
Optionally, the process that noise robustness processing is carried out to first phonetic feature are as follows:
First phonetic feature is handled by convolutional neural networks CNN, obtains the second phonetic feature.
Optionally, the convolutional neural networks CNN be a residual error network, including laying setting multilayer convolutional layer, two layers
As a unit, a down-sampling layer is arranged in convolutional layer between two units.
Optionally, every two layers of convolution interlayer uses jump connection.
Optionally, the process of voice context processing is carried out to second phonetic feature are as follows:
The timestamp information in the second phonetic feature is extracted by feedforward sequence memory network pFSMN, obtains voice or more
The content that text is included.
Optionally, the pFSMN includes multilayer, and the corresponding memory module Memory Block's of different layers is of different sizes.
Optionally, from small to large according to level, corresponding Memory Block is also from small to large.
Optionally, the lower-level in the pFSMN extracts the phoneme information in the phonetic feature, in the pFSMN
Higher levels extract semanteme, grammar property in the phonetic feature.
Optionally, the pFSMN includes the multilayer forward sequence memory layer of laying setting;Wherein, every layer of forward sequence note
Recall layer to be made of a memory module and a mapping layer.
Optionally, it jumps and connects between every two layers of forward sequence memory layer.
Optionally, the memory module is used to do voice context the related summation of weighting, and the mapping layer is used for will be high
It ties up in specification to lesser bottleneck dimension.
Optionally, the training method includes maximum mutual information training criterion LF-MMI and cross entropy CE multitask training.
Optionally, to the multidimensional processing result using maximum mutual information training criterion LF-MMI and cross entropy CE multitask
Trained mode is trained.
In order to achieve the above objects and other related objects, the present invention also provides a kind of equipment, comprising:
One or more processors;With
One or more machine readable medias of instruction are stored thereon with, when one or more of processors execute,
So that the equipment executes method described in one or more.
In order to achieve the above objects and other related objects, the present invention also provides one or more machine readable medias, thereon
It is stored with instruction, when executed by one or more processors, so that equipment executes method described in one or more.
As described above, a kind of parameter management method of the invention, device, have the advantages that
1. discrimination is high.The present invention is trained data using CNN-pyramidal-FSMN, can learn to other
The additional information that network cannot learn can also obtain good effect under noise circumstance and other complex environments;
2. training is fast.The loop structure of LSTM is avoided, pFSMN can preferably utilize parallel computation in GPU, simultaneously
Also it can reach long-term Dependency Specification included in LSTM.
Detailed description of the invention
Fig. 1 is a kind of flow chart of parameter management method in one embodiment of the invention;
Fig. 2 is the network architecture schematic diagram of CNN-pFSMN in one embodiment of the invention;
Fig. 3 is the block diagram of memory module in one embodiment of the invention;
Fig. 4 is the hardware structural diagram of terminal device in one embodiment of the invention;
Fig. 5 is the hardware structural diagram of terminal device in one embodiment of the invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from
Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation
Feature in example can be combined with each other.
It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment
Think, only shown in schema then with related component in the present invention rather than component count, shape and size when according to actual implementation
Draw, when actual implementation kenel, quantity and the ratio of each component can arbitrarily change for one kind, and its assembly layout kenel
It is likely more complexity.
As shown in Figure 1, the present embodiment provides a kind of parameter management method, the parameter management method includes:
S1 obtains voice data and extracts the first phonetic feature from the voice data;
S2 carries out multidimensional processing to first phonetic feature, exports multidimensional processing result;Wherein, the multidimensional processing packet
Include at least one of: noise robustness processing, voice context processing;
S3 is trained the multidimensional processing result using corresponding training method, obtains Multidimensional object parameter information.
In the actual process, the Multidimensional object parameter information includes mel-frequency cepstrum coefficient MFCC, speaker information Ivector.
Fig. 2 is a kind of neural network framework used by realizing above-mentioned management method, as shown in Fig. 2, including mel-frequency
Cepstrum coefficient MFCC;Speaker's correlated characteristic Ivector;Multiple convolutional layer Conv;Down-sampling layer subsample;Multiple feedforwards
PFSMN layers of serial memorization network;Full articulamentum Linear layer;LF-MMI output end;CE output end;Wherein, the convolution
The input of layer exported as pFSMN layers of the feedforward sequence memory network, pFSMN layers of processing result input full articulamentum,
It is trained by way of maximum mutual information training criterion LF-MMI and cross entropy CE multitask training, training result passes through
LF-MMI output end, CE output end are exported.
Further, Conv includes Conv1, Conv2, Conv3, Conv4, Conv5, Conv6;Wherein, Conv1, Conv2
A as a unit;Conv3, Conv4 B as a unit;Conv5, Conv6 C as a unit;A, there is one between B
subsample;B, there is a subsample between C.
Further, pFSMN layers include pFSMN layer 1, pFSMN layer 2, pFSMN layer 3, pFSMN layer 4, pFSMN layer 5,
PFSMN layer 6, pFSMN layer 7, pFSMN layer 8, pFSMN layer 9, pFSMN layer 10.
Further, neural network framework is CNN-pFSMN model, and input is by 40 dimension high-resolution MFCC+3 dimensions
Pitch+100 ties up Ivector and constitutes, and 43 dimension MFCC+pitch of output are turned by idct (discrete cosine transform) transformation first
The filter bank feature of 40 dimensions is turned to, 100 dimension Ivector constitute 200 dimensions by a linear transformation, in total 240 dimension input
It is cut into 40*6 (channel quantity) and is used as CNN (convolutional neural networks, Convolutional Neural Networks) most
First input.
LF-MMI trains criterion, i.e. maximum mutual information criterion (lattice-free Maximum mutual
Information), by calculating all possible annotated sequence in neural network output layer, according to these annotated sequences
Corresponding MMI information and relevant gradient are calculated, training is then completed by gradient propagation algorithm.LF-MMI trains criterion energy
Enough posterior probability (Posterior Probability) for directly calculating all possible paths in the training process, eliminate mirror
Need to generate the trouble of word figure before other property training in advance.
LF-MMI is intended to maximize the mutual information of word sequence distribution and observation sequence distribution.Assuming that observation sequenceWord sequenceWherein m indicates that utterance, Tm indicate that frame number, Nm indicate single
Word number.Training set is S={ (om,wm) | 0≤m≤M }, LF-MMI training criterion can be expressed as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths,
Molecule is obtained by molecule word figure, and denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0
≤ m≤M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles audio, smIndicate that status switch, k indicate acoustics
Zoom factor, w indicate all possible sequence in denominator word figure.P (w) is the prior probability of sequence, and corresponding language model is beaten
Point;W is followed by target wmDifference be, wmIt is the real word sequence of a certain audio, is used as trained label, and
W because front sums to w, representative be all possible sequence in denominator word figure (wherein contain true sequence, and with
Sequence similar in real sequence).So the purpose of this criterion be exactly as far as possible true sequence and mistake but with
The very similar sequence of real sequence distinguishes.
When calculating, need to obtain the result of denominator and molecule respectively.The result of molecule is obtained by the molecule word figure generated, point
Female result is obtained by the denominator word figure generated.
Specifically, molecule word figure and denominator word figure can be calculated by the following method:
Denominator word map generalization:
On the basis of LF-MMI, training label is converted to phoneme information first, generates and is based on phoneme (Phone) rank
Language model, most possible aligned phoneme sequence is then converted to using training data.Construct the language model of 4-gram.Then
The decoding fst figure based on phonemic language model, i.e. HCP figure are constructed again.
Wherein, when generating denominator word figure, denominator word map generalization is transferred to GPU (Graphics Processing
Unit, graphics processor) in carry out, the time can be greatly optimized.
Molecule word map generalization:
Label data in training data is converted to the sequence of phonetic units.Then it is converted to by the decoding figure of HCP
Fst format, possibility sequence needed for obtaining molecule word figure.
When training, by calculating molecule denominator word graphic sequence probability of occurrence, the loss value namely formula of MMI can be obtained
(1)。
The present embodiment is avoided by modification denominator word figure and molecule word map generalization mode to prescheme and stringent right
Neat demand, word figure needed for directly being generated in training.
CE cross entropy trains criterion, is the criterion that a kind of pair of probability distribution optimizes, may be expressed as:
yiIt is the experienced probability distribution (from the mark of training data) observed feature o and belong to class i,It is using DNN
The probability of estimation.
By maximum mutual information training criterion LF-MMI and cross entropy CE multitask training, more effectively to neural network into
Issuable over-fitting, final output Multidimensional object parameter information when going and train, while also avoiding training;In reality
In the process, the Multidimensional object parameter information includes mel-frequency cepstrum coefficient MFCC, speaker's correlated characteristic Ivector.
In some embodiments, the process that noise robustness processing is carried out to first phonetic feature are as follows: pass through
Convolutional neural networks CNN handles first phonetic feature, obtains the second phonetic feature.
Wherein, the first phonetic feature can be understood as the initial speech feature extracted from voice data, in initial speech
Both include voice in feature, also includes noise;Noise robustness is carried out to first phonetic feature by convolutional neural networks CNN
Property processing (reduce noise or eliminate noise), obtain the second phonetic feature.The second phonetic feature refers to reduction noise herein
The first phonetic feature, or eliminate noise after the first phonetic feature.
The convolutional neural networks CNN is a residual error network, the multilayer convolutional layer including laying setting, two layers of convolutional layer work
For a unit, a down-sampling layer is set between two unit.Specifically, convolutional neural networks CNN includes the 6 of laying setting
Layer convolutional layer, wherein convolution kernel uses 3*3, and port number is respectively 6 (in) -64-64-128-128-256-256 (out).Setting
After down-sampling layer, so that characteristic dimension is followed successively by 40 (in) -40-20-20-10-10 (out).
In some embodiments, every two layers of convolution interlayer uses jump connection.By using the mode of jump connection, so that
Gradient transmitting is more reasonable.
In some embodiments, the process of voice context processing is carried out to second phonetic feature are as follows:
Pass through feedforward sequence memory network pFSMN (pyramidal feedforward sequential memory
Network the timestamp information in the second phonetic feature) is extracted, the content that voice context is included is obtained.
Further, as shown in Fig. 2, the pFSMN includes multilayer;Wherein, the corresponding memory module Memory of different layers
Block's is of different sizes;And from small to large according to level, corresponding Memory Block is also from small to large.
In some embodiments, the lower-level in the pFSMN extracts the phoneme information in the phonetic feature, described
Higher levels in pFSMN extract semanteme, grammar property in the phonetic feature.
In some embodiments, the pFSMN includes the multilayer forward sequence memory layer of laying setting, every two layers of forward direction sequence
It jumps and connects between column memory layer.Wherein, every layer of forward sequence memory layer is made of a memory module and a mapping layer;Institute
Memory module is stated for voice context to be done the related summation of weighting, the mapping layer is used for higher-dimension specification to lesser bottleneck
In dimension, feature can be allowed more to concentrate characterization.
Fig. 3 is the memory module memory block building-block of logic in feedforward sequence memory network pFSMN, is used for reference
The concept of FIR filter and filter IIR in DSP (Digital Signal Processor, digital signal processor), i.e., without
Limit impact response filter is equivalent to finite impulse response filter.It is exactly RNN (Recurrent in the concept of deep learning
Neural Network, Recognition with Recurrent Neural Network) loop structure can be replaced by a kind of acyclic structure, while can also obtain
Take prolonged contextual information.
The nucleus module of pFSMN is exactly the information using before and after frames, is blocksum, is transmitted to next memory
Inside block (memory module), by pFSMN layers several, long-term contextual information can be got.But it is traditional
FSMN is not high to the utilization efficiency of context, and every layer all uses identical before and after frames, i.e. 20 frames of front and back.If directly by bottom
The Memory of Memory Block be directly appended in the Memory Block on upper layer, this will lead to upper layer and bottom possesses phase
With memory, doing so is unusual redundancy.In network architecture design in the present embodiment, pFSMN layer one shares ten layers, bottom
Memory Block it is smaller, more high-rise Memory Block successively becomes larger, such structure can use bottom extract phoneme
Information, and semantic and grammar property is extracted with high level;In the Memory Block using connection the connection bottom and upper layer that jump
When, only when the size of Memory Block is different, just it is attached.Meanwhile each pFSMN layers can pass through batch-
Normalization (batch normalization), ReLu (Rectified Linear Unit, line rectification) and l2-
Regularization (l2- regularization).
In some embodiments, the training method includes maximum mutual information training criterion LF-MMI and cross entropy CE more
Business training.
In some embodiments, to the multidimensional processing result using maximum mutual information training criterion LF-MMI and cross entropy
The mode of CE multitask training is trained.
In some embodiments, a kind of parameter management device is also provided, the parameter management device includes:
Characteristic extracting module, for obtaining voice data and extracting from the voice data the first phonetic feature;
Phonetic feature processing module exports multidimensional processing result for carrying out multidimensional processing to first phonetic feature;
Wherein, the multidimensional processing includes at least one of: noise robustness processing, voice context processing;
Training module obtains multidimensional mesh for being trained to the multidimensional processing result using corresponding training method
Mark parameter information.In the actual process, the Multidimensional object parameter information includes mel-frequency cepstrum coefficient MFCC, speaker's phase
Close feature Ivector.
In some embodiments, the process that noise robustness processing is carried out to first phonetic feature are as follows:
First phonetic feature is handled by convolutional neural networks CNN, obtains the second phonetic feature.
Wherein, the first phonetic feature can be understood as the initial speech feature extracted from voice data, in initial speech
Both include voice in feature, also includes noise;Noise robustness is carried out to first phonetic feature by convolutional neural networks CNN
Property processing (reduce noise or eliminate noise), obtain the second phonetic feature.The second phonetic feature refers to reduction noise herein
The first phonetic feature, or eliminate noise after the first phonetic feature.
In some embodiments, the convolutional neural networks CNN is a residual error network, the multilayer convolution including laying setting
Layer, every two layers of convolution interlayer is using jump connection, by using the mode of jump connection, so that gradient transmitting is more reasonable.Two
As a unit, a down-sampling layer is arranged in layer convolutional layer between two units.
In some embodiments, the process of voice context processing is carried out to second phonetic feature are as follows:
The timestamp information in the second phonetic feature is extracted by feedforward sequence memory network pFSMN, obtains voice or more
The content that text is included.
In some embodiments, the pFSMN includes multilayer, and the corresponding memory module Memory Block's of different layers is big
Small difference.
In some embodiments, from small to large according to level, corresponding Memory Block is also from small to large.
In some embodiments, the lower-level in the pFSMN extracts the phoneme information in the phonetic feature, described
Higher levels in pFSMN extract semanteme, grammar property in the phonetic feature.
In some embodiments, the pFSMN includes the multilayer forward sequence memory layer of laying setting, every two layers of forward direction sequence
It jumps and connects between column memory layer.Wherein, every layer of forward sequence memory layer is made of a memory module and a mapping layer.Institute
Memory module is stated for voice context to be done the related summation of weighting, the mapping layer is used for higher-dimension specification to lesser bottleneck
In dimension.
In some embodiments, the training method includes maximum mutual information training criterion LF-MMI and cross entropy CE more
Business training.
In some embodiments, to the multidimensional processing result using maximum mutual information training criterion LF-MMI and cross entropy
The mode of CE multitask training is trained.It is trained by maximum mutual information training criterion LF-MMI and cross entropy CE multitask,
Issuable over-fitting when being more effectively trained to neural network, while also avoiding training.
It should be noted that the embodiment due to device part is corresponded to each other with the embodiment of method part, device
The content of partial embodiment refers to the description of the embodiment of method part, wouldn't repeat here.
The embodiment of the present application also provides a kind of equipment, which may include: one or more processors;It deposits thereon
The one or more machine readable medias for containing instruction, when being executed by one or more of processors, so that the equipment
Execute method described in Fig. 1.In practical applications, which can be used as terminal device, can also be used as server, and terminal is set
Standby example may include: smart phone, tablet computer, E-book reader, MP3 (dynamic image expert's compression standard audio
Level 3, Moving Picture Experts Group Audio Layer III) player, MP4 (dynamic image expert pressure
Contracting standard audio level 4, Moving Picture Experts Group Audio Layer IV) it is player, on knee portable
Computer, vehicle-mounted computer, desktop computer, set-top box, intelligent TV set, wearable device etc., the embodiment of the present application for
Specific equipment is without restriction.
The embodiment of the present application also provides a kind of non-volatile readable storage medium, be stored in the storage medium one or
Multiple modules (programs) when the one or more module is used in equipment, can make the equipment execute the application reality
Apply the instruction (instructions) of the included step of parameter management method in Fig. 1 of example.
Fig. 4 is the hardware structural diagram for the terminal device that one embodiment of the application provides.As shown, the terminal device
It may include: input equipment 1100, processor 1101, output equipment 1102, memory 1103 and at least one communication bus
1104.Communication bus 1104 is for realizing the communication connection between element.Memory 1103 may include high speed RAM memory,
It may also further include non-volatile memories NVM, a for example, at least magnetic disk storage can store various journeys in memory 1103
Sequence, for completing various processing functions and realizing the method and step of the present embodiment.
Optionally, above-mentioned processor 1101 can be for example central processing unit (Central Processing Unit, letter
Claim CPU), application specific integrated circuit (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), can compile
Journey logical device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components
It realizes, which is coupled to above-mentioned input equipment 1100 and output equipment 1102 by wired or wireless connection.
Optionally, above-mentioned input equipment 1100 may include a variety of input equipments, such as may include user oriented use
At least one of family interface, device oriented equipment interface, the programmable interface of software, camera, sensor.Optionally, should
Device oriented equipment interface can be wireline interface for carrying out data transmission between equipment and equipment, can also be and is used for
Hardware insertion interface (such as USB interface, serial ports etc.) carried out data transmission between equipment and equipment;Optionally, should towards with
The user interface at family for example can be user oriented control button, voice-input device and use for receiving voice input
The touch awareness apparatus (such as touch screen, Trackpad with touch sensing function etc.) of family reception user's touch input;It is optional
, the programmable interface of above-mentioned software for example can be the entrance for editing or modifying program for user, such as the input of chip
Pin interface or input interface etc.;Output equipment 1102 may include the output equipments such as display, sound equipment.
In the present embodiment, the processor of the terminal device includes for executing each module of parameter management device in each equipment
Function, concrete function and technical effect are referring to above-described embodiment, and details are not described herein again.
Fig. 5 is the hardware structural diagram for the terminal device that one embodiment of the application provides.Fig. 5 is to Fig. 4 in reality
A specific embodiment during now.As shown, the terminal device of the present embodiment may include processor 1201 and
Memory 1202.
Processor 1201 executes the computer program code that memory 1202 is stored, and realizes Fig. 1 in above-described embodiment
Management method.
Memory 1202 is configured as storing various types of data to support the operation in terminal device.These data
Example includes the instruction of any application or method for operating on the terminal device, such as message, picture, video etc..
Memory 1202 may include random access memory (random access memory, abbreviation RAM), it is also possible to further include non-
Volatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Optionally, processor 1201 is arranged in processing component 1200.The terminal device can also include: communication component
1203, power supply module 1204, multimedia component 1205, audio component 1206, input/output interface 1207 and sensor module
1208.Component that terminal device is specifically included etc. is set according to actual demand, and the present embodiment is not construed as limiting this.
The integrated operation of the usual controlling terminal equipment of processing component 1200.Processing component 1200 may include one or more
Processor 1201 executes instruction, to complete all or part of the steps of method shown in above-mentioned Fig. 1.In addition, processing component 1200
It may include one or more modules, convenient for the interaction between processing component 1200 and other assemblies.For example, processing component 1200
It may include multi-media module, to facilitate the interaction between multimedia component 1205 and processing component 1200.
Power supply module 1204 provides electric power for the various assemblies of terminal device.Power supply module 1204 may include power management
System, one or more power supplys and other with for terminal device generate, manage, and distribute the associated component of electric power.
Multimedia component 1205 includes the display screen of one output interface of offer between terminal device and user.One
In a little embodiments, display screen may include liquid crystal display (LCD) and touch panel (TP).If display screen includes touch surface
Plate, display screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touchings
Sensor is touched to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or cunning
The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.
Audio component 1206 is configured as output and input audio signal.For example, audio component 1206 includes a Mike
Wind (MIC), when terminal device is in operation mode, when such as speech recognition mode, microphone is configured as receiving external audio letter
Number.The received audio signal can be further stored in memory 1202 or send via communication component 1203.Some
In embodiment, audio component 1206 further includes a loudspeaker, is used for output audio signal.
Input/output interface 1207 provides interface between processing component 1200 and peripheral interface module, and above-mentioned periphery connects
Mouth mold block can be click wheel, button etc..These buttons may include, but are not limited to: volume button, start button and locking press button.
Sensor module 1208 includes one or more sensors, and the state for providing various aspects for terminal device is commented
Estimate.For example, sensor module 1208 can detecte the state that opens/closes of terminal device, the relative positioning of component, Yong Huyu
The existence or non-existence of terminal device contact.Sensor module 1208 may include proximity sensor, be configured to do not having
Detected the presence of nearby objects when any physical contact, including detection user between terminal device at a distance from.In some implementations
In example, which can also be including camera etc..
Communication component 1203 is configured to facilitate the communication of wired or wireless way between terminal device and other equipment.Eventually
End equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In one embodiment
In, it may include SIM card slot in the terminal device, which step on terminal device for being inserted into SIM card
GPRS network is recorded, is communicated by internet with server foundation.
From the foregoing, it will be observed that communication component 1203, audio component 1206 involved in Fig. 5 embodiment and input/output
Interface 1207, sensor module 1208 can be used as the implementation of the input equipment in Fig. 4 embodiment.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should be covered by the claims of the present invention.
Claims (30)
1. a kind of parameter management method, which is characterized in that the parameter management method includes:
It obtains voice data and extracts the first phonetic feature from the voice data;
Multidimensional processing is carried out to first phonetic feature, exports multidimensional processing result;Wherein, the multidimensional processing includes following
At least one: noise robustness processing, voice context processing;
The multidimensional processing result is trained using corresponding training method, obtains Multidimensional object parameter information.
2. parameter management method according to claim 1, which is characterized in that the Multidimensional object parameter information includes Meier
Frequency cepstral coefficient MFCC, speaker's correlated characteristic Ivector.
3. parameter management method according to claim 1, which is characterized in that described to make an uproar to first phonetic feature
The process of sound robustness processing are as follows:
First phonetic feature is handled by convolutional neural networks CNN, obtains the second phonetic feature.
4. parameter management method according to claim 3, which is characterized in that the convolutional neural networks CNN is a residual error
Network, the multilayer convolutional layer including laying setting, two layers of convolutional layer, which is arranged as a unit, between two units under one, to be adopted
Sample layer.
5. parameter management method according to claim 4, which is characterized in that every two layers of convolution interlayer uses jump connection.
6. parameter management method according to claim 3, which is characterized in that carried out on voice to second phonetic feature
The process hereafter handled are as follows:
The timestamp information in the second phonetic feature is extracted by feedforward sequence memory network pFSMN, obtains voice context institute
The content for including.
7. parameter management method according to claim 6, which is characterized in that the pFSMN includes multilayer, and different layers are corresponding
Memory module Memory Block it is of different sizes.
8. parameter management method according to claim 7, which is characterized in that from small to large according to level, corresponding
Memory Block is also from small to large.
9. parameter management method according to claim 8, which is characterized in that the lower-level in the pFSMN extracts institute
The phoneme information in phonetic feature is stated, it is special that the higher levels in the pFSMN extract semanteme, grammer in the phonetic feature
Sign.
10. parameter management method according to claim 6, which is characterized in that the pFSMN includes the multilayer of laying setting
Forward sequence remembers layer;Wherein, every layer of forward sequence memory layer is made of a memory module and a mapping layer.
11. parameter management method according to claim 10, which is characterized in that jumped between every two layers of forward sequence memory layer
Jump connection.
12. parameter management method according to claim 10, which is characterized in that the memory module is used for will be above and below voice
Text does the related summation of weighting, and the mapping layer is used for higher-dimension specification to lesser bottleneck dimension.
13. parameter management method according to claim 1, which is characterized in that the training method includes maximum mutual information
Training criterion LF-MMI and cross entropy CE multitask training.
14. parameter management method according to claim 13, which is characterized in that the multidimensional processing result using maximum
Mutual information training criterion LF-MMI and the mode of cross entropy CE multitask training are trained.
15. a kind of parameter management device, which is characterized in that the parameter management device includes:
Characteristic extracting module, for obtaining voice data and extracting from the voice data the first phonetic feature;
Phonetic feature processing module exports multidimensional processing result for carrying out multidimensional processing to first phonetic feature;Its
In, the multidimensional processing includes at least one of: noise robustness processing, voice context processing;
Training module obtains Multidimensional object ginseng for being trained to the multidimensional processing result using corresponding training method
Number information.
16. parameter management device according to claim 15, which is characterized in that the Multidimensional object parameter information includes plum
You are frequency cepstral coefficient MFCC, speaker's correlated characteristic Ivector.
17. parameter management device according to claim 15, which is characterized in that described to be carried out to first phonetic feature
The process of noise robustness processing are as follows:
First phonetic feature is handled by convolutional neural networks CNN, obtains the second phonetic feature.
18. parameter management device according to claim 17, which is characterized in that the convolutional neural networks CNN is one residual
Poor network, the multilayer convolutional layer including laying setting, two layers of convolutional layer as a unit, are arranged under one between two units
Sample level.
19. parameter management device according to claim 18, which is characterized in that every two layers of convolution interlayer is connected using jump
It connects.
20. parameter management device according to claim 17, which is characterized in that carry out voice to second phonetic feature
The process of context processing are as follows:
The timestamp information in the second phonetic feature is extracted by feedforward sequence memory network pFSMN, obtains voice context institute
The content for including.
21. parameter management device according to claim 20, which is characterized in that the pFSMN includes multilayer, different layers pair
The memory module Memory Block's answered is of different sizes.
22. parameter management device according to claim 21, which is characterized in that from small to large according to level, corresponding
Memory Block is also from small to large.
23. parameter management device according to claim 22, which is characterized in that the lower-level in the pFSMN is extracted
Phoneme information in the phonetic feature, it is special that the higher levels in the pFSMN extract semanteme, grammer in the phonetic feature
Sign.
24. parameter management device according to claim 20, which is characterized in that the pFSMN includes the more of laying setting
Layer forward sequence remembers layer;Wherein, every layer of forward sequence memory layer is made of a memory module and a mapping layer.
25. parameter management device according to claim 24, which is characterized in that jumped between every two layers of forward sequence memory layer
Jump connection.
26. parameter management device according to claim 24, which is characterized in that the memory module is used for will be above and below voice
Text does the related summation of weighting, and the mapping layer is used for higher-dimension specification to lesser bottleneck dimension.
27. parameter management device according to claim 15, which is characterized in that the training method includes maximum mutual information
Training criterion LF-MMI and cross entropy CE multitask training.
28. parameter management device according to claim 27, which is characterized in that the multidimensional processing result using maximum
Mutual information training criterion LF-MMI and the mode of cross entropy CE multitask training are trained.
29. a kind of equipment characterized by comprising
One or more processors;With
One or more machine readable medias of instruction are stored thereon with, when one or more of processors execute, so that
The equipment executes the method as described in one or more in claim 1-14.
30. one or more machine readable medias, are stored thereon with instruction, when executed by one or more processors, so that
Equipment executes the method as described in one or more in claim 1-14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910600722.3A CN110335591A (en) | 2019-07-04 | 2019-07-04 | A kind of parameter management method, device, machine readable media and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910600722.3A CN110335591A (en) | 2019-07-04 | 2019-07-04 | A kind of parameter management method, device, machine readable media and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110335591A true CN110335591A (en) | 2019-10-15 |
Family
ID=68143160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910600722.3A Pending CN110335591A (en) | 2019-07-04 | 2019-07-04 | A kind of parameter management method, device, machine readable media and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110335591A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144511A (en) * | 2019-12-31 | 2020-05-12 | 上海云从汇临人工智能科技有限公司 | Image processing method, system, medium and electronic terminal based on neural network |
CN112259080A (en) * | 2020-10-20 | 2021-01-22 | 成都明杰科技有限公司 | Speech recognition method based on neural network model |
CN112581942A (en) * | 2020-12-29 | 2021-03-30 | 云从科技集团股份有限公司 | Method, system, device and medium for recognizing target object based on voice |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN105869630A (en) * | 2016-06-27 | 2016-08-17 | 上海交通大学 | Method and system for detecting voice spoofing attack of speakers on basis of deep learning |
CN106919977A (en) * | 2015-12-25 | 2017-07-04 | 科大讯飞股份有限公司 | A kind of feedforward sequence Memory Neural Networks and its construction method and system |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109767759A (en) * | 2019-02-14 | 2019-05-17 | 重庆邮电大学 | End-to-end speech recognition methods based on modified CLDNN structure |
-
2019
- 2019-07-04 CN CN201910600722.3A patent/CN110335591A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN106919977A (en) * | 2015-12-25 | 2017-07-04 | 科大讯飞股份有限公司 | A kind of feedforward sequence Memory Neural Networks and its construction method and system |
CN105869630A (en) * | 2016-06-27 | 2016-08-17 | 上海交通大学 | Method and system for detecting voice spoofing attack of speakers on basis of deep learning |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109767759A (en) * | 2019-02-14 | 2019-05-17 | 重庆邮电大学 | End-to-end speech recognition methods based on modified CLDNN structure |
Non-Patent Citations (3)
Title |
---|
XUERUI YANG ET AL.: "A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition", 《ARXIV: SOUND》 * |
机器之心: "词错率2.97%:云从科技刷新语音识别世界纪录", 《链闻》 * |
风雨兼程: "错词率2.97%:语音识别世界纪录再次更新", 《知乎》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144511A (en) * | 2019-12-31 | 2020-05-12 | 上海云从汇临人工智能科技有限公司 | Image processing method, system, medium and electronic terminal based on neural network |
CN111144511B (en) * | 2019-12-31 | 2020-10-20 | 上海云从汇临人工智能科技有限公司 | Image processing method, system, medium and electronic terminal based on neural network |
CN112259080A (en) * | 2020-10-20 | 2021-01-22 | 成都明杰科技有限公司 | Speech recognition method based on neural network model |
CN112581942A (en) * | 2020-12-29 | 2021-03-30 | 云从科技集团股份有限公司 | Method, system, device and medium for recognizing target object based on voice |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11620983B2 (en) | Speech recognition method, device, and computer-readable storage medium | |
WO2021135577A9 (en) | Audio signal processing method and apparatus, electronic device, and storage medium | |
CN108352168A (en) | The low-resource key phrase detection waken up for voice | |
CN110288980A (en) | Audio recognition method, the training method of model, device, equipment and storage medium | |
CN108804536B (en) | Man-machine conversation and strategy generation method, equipment, system and storage medium | |
CN107358951A (en) | A kind of voice awakening method, device and electronic equipment | |
CN110335591A (en) | A kind of parameter management method, device, machine readable media and equipment | |
CN108780646A (en) | Intermediate scoring for the detection of modified key phrase and refusal loopback | |
US11769018B2 (en) | System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system | |
CN109874029A (en) | Video presentation generation method, device, equipment and storage medium | |
CN107221330A (en) | Punctuate adding method and device, the device added for punctuate | |
CN110097870B (en) | Voice processing method, device, equipment and storage medium | |
CN110097890A (en) | A kind of method of speech processing, device and the device for speech processes | |
EP3234946A1 (en) | System and method of automatic speech recognition using parallel processing for weighted finite state transducer-based speech decoding | |
CN104777911A (en) | Intelligent interaction method based on holographic technique | |
US20210375260A1 (en) | Device and method for generating speech animation | |
CN113421547B (en) | Voice processing method and related equipment | |
WO2023222088A1 (en) | Voice recognition and classification method and apparatus | |
CN112200318A (en) | Target detection method, device, machine readable medium and equipment | |
CN112069309A (en) | Information acquisition method and device, computer equipment and storage medium | |
WO2023207541A1 (en) | Speech processing method and related device | |
CN110379411A (en) | For the phoneme synthesizing method and device of target speaker | |
CN109767758A (en) | Vehicle-mounted voice analysis method, system, storage medium and equipment | |
CN114996422A (en) | Instruction recognition method and device, training method and computer readable storage medium | |
Luo et al. | Multi-quartznet: Multi-resolution convolution for speech recognition with multi-layer feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 511457 Guangdong city of Guangzhou province Nansha District Golden Road No. 26 room 1306 (only for office use) Applicant after: Yuncong Technology Group Co., Ltd Address before: 511457 Guangdong city of Guangzhou province Nansha District Golden Road No. 26 room 1306 (only for office use) Applicant before: GUANGZHOU YUNCONG INFORMATION TECHNOLOGY CO., LTD. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191015 |
|
RJ01 | Rejection of invention patent application after publication |