CN107871499A - Audio recognition method, system, computer equipment and computer-readable recording medium - Google Patents
Audio recognition method, system, computer equipment and computer-readable recording medium Download PDFInfo
- Publication number
- CN107871499A CN107871499A CN201711031665.9A CN201711031665A CN107871499A CN 107871499 A CN107871499 A CN 107871499A CN 201711031665 A CN201711031665 A CN 201711031665A CN 107871499 A CN107871499 A CN 107871499A
- Authority
- CN
- China
- Prior art keywords
- collection
- word
- search network
- fraction
- decoding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000001360 synchronised effect Effects 0.000 claims abstract description 50
- 238000004590 computer program Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 8
- RLLPVAHGXHCWKJ-IEBWSBKVSA-N (3-phenoxyphenyl)methyl (1s,3s)-3-(2,2-dichloroethenyl)-2,2-dimethylcyclopropane-1-carboxylate Chemical compound CC1(C)[C@H](C=C(Cl)Cl)[C@@H]1C(=O)OCC1=CC=CC(OC=2C=CC=CC=2)=C1 RLLPVAHGXHCWKJ-IEBWSBKVSA-N 0.000 claims description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 230000005540 biological transmission Effects 0.000 abstract description 37
- 230000008569 process Effects 0.000 description 9
- 230000004913 activation Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000009432 framing Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 108010076504 Protein Sorting Signals Proteins 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006690 co-activation Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000005654 stationary process Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000002834 transmittance Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/083—Recognition networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application is related to a kind of audio recognition method, system, computer equipment and storage medium.Including word search network in phonic signal character sequence inputting single-tone element search network and collection is synchronized into decoding;Obtain word output state fraction in the collection that word search network decoding obtains in collection;When word output state fraction meets preparatory condition in collection, the confidence level of single-tone element search network and word search network synchronous decoding in collection is obtained;The decoding paths according to corresponding to selecting confidence level, output obtain voice identification result.Above-mentioned audio recognition method, system, computer equipment and computer-readable recording medium, decoding transmission is carried out by the way that word in phonic signal character sequence inputting single-tone element search network and collection is searched for into network simultaneously, it is other that word identification and the outer word rejection of collection in collection can be effectively realized, it is ensured that recognition accuracy;Voice identification result is obtained further according to decoding paths corresponding to confidence level selection, can further improve speech discrimination accuracy.
Description
Technical field
The application is related to technical field of voice recognition, more particularly to a kind of audio recognition method, system, computer equipment
And computer-readable recording medium.
Background technology
With the fast development and application of computer technology, further realize that it is artificial intelligence to carry out speech exchange with machine
With an important directions of machine learning application, speech recognition technology is exactly to allow machine that voice is believed by identification and understanding process
Number it is changed into the technology of corresponding text or order.The application of speech recognition at present can be broadly divided into both direction:One is big
Vocabulary continuous speech identifying system, it is applied to mobile phone assistant, voice dictation etc.;Another is to small vocabulary pocket language
Sound production development, such as intelligent toy, household remote etc..
Small vocabulary speech identifying system in wherein second application gradually starts to obtain in fields such as handheld terminal, household electrical appliances
To application because its towards be small vocabulary, relative to the first system in addition to the influence that noise jamming is brought will also
Consider the interference of the substantial amounts of outer word of collection, i.e., to ensure to collect will also refuse to collect outer word while interior word correctly identifies.It is and traditional small
The product using effect of vocabulary speech identifying system is still not fully up to expectations, can not such as effectively realize the identification of order word and collection in collection
Outer word rejection is other, and speech discrimination accuracy is low.
The content of the invention
Based on this, it is necessary in view of the above-mentioned problems, word identification and the outer word rejection of collection in collection can be effectively realized by providing one kind
Not, audio recognition method, system, computer equipment and the computer-readable recording medium of recognition accuracy are improved.
A kind of audio recognition method, including:
Phonic signal character sequence is inputted into word search network in single-tone element search network and collection respectively, and synchronizes solution
Code;
Obtain word output state fraction in the collection that the synchronous decoding obtains;
When word output state fraction meets preparatory condition in the collection, the single-tone element search network and the collection are obtained
The confidence level of interior word search network synchronous decoding;
Voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
In one embodiment, it is described by phonic signal character sequence input respectively single-tone element search network and collection in word search
Rope network, and include the step of carry out synchronous decoding:
By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output state fraction is obtained;
When the first output state fraction is more than the first predetermined threshold value, next frame phonic signal character sequence is distinguished
Input word search network in the single-tone element search network and the collection and synchronize decoding.
In one embodiment, it is described that single-tone element described in current frame speech signal characteristic sequence inputting is searched for into network, obtain
Include to the step of the first output state fraction:
By single-tone element search network described in the current frame speech signal characteristic sequence inputting;
Obtain the current frame speech signal characteristic sequence and the joint probability of single-tone element search network primitive;
Using the maximum in the joint probability as the first output state fraction.
In one embodiment, it is described when word output state fraction meets preparatory condition in the collection, obtain the list
Phoneme searches for network with including in the collection the step of confidence level of word search network synchronous decoding:
When word output state fraction meets the preparatory condition in the collection, the single-tone element search Network Synchronization is obtained
The first of decoding transmits the second transmission fraction of word search network synchronous decoding in fraction and the collection;
Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
In one embodiment, it is described when word output state fraction meets preparatory condition in the collection, obtain the list
Phoneme searches for network with including in the collection the step of confidence level of word search network synchronous decoding:
When word output state fraction is more than the second predetermined threshold value in the collection, institute is obtained respectively by network topology
State the first transmission fraction and described second and transmit fraction;
Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
In one embodiment, the decoding paths according to corresponding to confidence level selection, output obtain speech recognition
As a result the step of, includes:
Obtain the frame number for meeting the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition;
According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
In one embodiment, it is described by phonic signal character sequence input respectively single-tone element search network and collection in word search
Rope network, and the step of carry out synchronous decoding before include:
Obtain voice signal;
End-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
A kind of speech recognition system, including:
Synchronous decoding module, searched for for phonic signal character sequence to be inputted into word in single-tone element search network and collection respectively
Network, and carry out synchronous decoding;
State fraction acquisition module, word output state fraction in the collection obtained for obtaining the synchronous decoding;
Confidence level acquisition module, for when word output state fraction meets preparatory condition in the collection, obtaining the list
Phoneme searches for the confidence level of network and word search network synchronous decoding in the collection;
Speech recognition output module, for the decoding paths according to corresponding to confidence level selection, output obtains voice knowledge
Other result.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor
The computer program of upper operation, audio recognition method as described above is realized described in the computing device during computer program.
A kind of computer-readable recording medium, is stored thereon with computer program, and the program is realized when being executed by processor
Audio recognition method as described above.
Above-mentioned audio recognition method, system, computer equipment and computer-readable recording medium, by phonic signal character sequence
Arrange to synchronize to decode by word search network in the plain search network of single-tone and collection respectively and transmit, when word searches for network decoding in collection
When word output state fraction meets preparatory condition in obtained collection, single-tone element search network and word search Network Synchronization in collection are obtained
The confidence level of decoding, the finally decoding paths according to corresponding to the confidence level, output obtain voice identification result.By simultaneously by language
Word search network carries out decoding transmission in sound signal characteristic sequence input single-tone element search network and collection, can effectively realize in collection
Word identifies and the outer word rejection of collection is other, it is ensured that recognition accuracy;Voice is obtained further according to decoding paths corresponding to confidence level selection to know
Other result, can further improve speech discrimination accuracy.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the embodiment of audio recognition method one of the application;
Fig. 2 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search
Word search network in rope network and collection, and carry out the step schematic flow sheet of synchronous decoding;
Fig. 3 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search
Word search network in rope network and collection, and carry out the step schematic flow sheet of synchronous decoding;
Fig. 4 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search
Word search network in rope network and collection, and the step of carry out synchronous decoding before schematic flow sheet;
Fig. 5 be the application the embodiment of audio recognition method one in end-point detection schematic flow sheet;
Fig. 6 is the structural representation of the embodiment of speech recognition system one of the application.
Embodiment
For the object, technical solution and advantage of the application are more clearly understood, below in conjunction with drawings and Examples, to this
Application is described in further detail.It should be appreciated that embodiment described herein is only to explain the application,
The protection domain of the application is not limited.
Fig. 1 is the schematic flow sheet of the embodiment of audio recognition method one of the application.As shown in figure 1, the language of the present embodiment
Voice recognition method, including:
Step S101, phonic signal character sequence is inputted into word in single-tone element search network and collection respectively and searches for network, and
Synchronize decoding.
Voice is one kind of sound, is sent by the phonatory organ of people, have certain grammer and meaning, carry it is specific
The analog signal of information.Voice signal is analog quantity, so the processing to voice signal is needed first by sampling, quantification technique
The voice signal of simulation is converted into data signal, the wherein sample frequency of voice signal need to meet nyquist sampling theorem,
I.e. sample frequency necessarily be greater than twice of voice signal highest frequency to be sampled.In addition, voice signal include it is numerous uncorrelated
Information, such as background noise, mood etc., so having used substantial amounts of phonic signal character to join in speech recognition evolution
Number, the basic thought of the extraction of characteristic parameter are that pretreated signal is removed into redundancy section by linear transformation, and in generation
The characteristic parameter of predicative sound essence is released, and is finally based on carrying out speech recognition on this feature parameter again.Voice signal is entering
Before row feature extraction, a series of pretreatment will be done to primary speech signal sequence by endpoint detection module, as framing,
Adding window, preemphasis and Fourier transformation etc. pre-process.The characteristic parameter of voice signal includes time domain parameter, such as short-time average energy
Amount, pitch period etc.;Also include frequency domain parameter, such as short-term spectrum, first three formant.It is the most frequently used in terms of speech recognition
To phonetic feature be exactly mel cepstrum coefficients (MFCC), mel cepstrum coefficients are falling of being extracted in Mel scale frequencies domain
Parameter is composed, Mel scales are described the nonlinear characteristic of human ear frequency, can be extracted to obtain voice signal using mel cepstrum coefficients
Characteristic sequence.
The News Search net that single-tone element search network is made up of all single phonemes for forming any word as primitive
Network, this network can be used for start collection in word search network, and with collection word search network together with decode for identify it is other with rejection.
Phoneme be according to the natural quality of voice mark off come least speech unit, articulation forms a phoneme, such as
" ba " includes " b ", " a " two articulations, is two single phonemes.But the sounding (waveform) of a word actually depends on very
It is multifactor, rather than just phoneme, such as phoneme context, speaker, voice style etc..It is single when considering these factors
Phoneme, which can be placed in context, to be considered, material is thus formed ternary phoneme or polynary phoneme.Word search network is in collection
Based on the News Search network of triphones primitive composition, the contextual information between phoneme has been contained in triphones, for from
Order word in search collection in phonic signal character sequence.The phonic signal character sequence for extracting to obtain from voice signal is distinguished
In inputting single-tone element search network simultaneously and collecting in word search network, decoding is synchronized.During synchronous decoding, by setting together
Signal is walked, phonic signal character sequence word search network in single-tone element network and collection is kept while is scanned for decoding and shape
State transmission.Phonic signal character sequence is inputted simultaneously in single-tone element search network and collection and synchronize solution in word search network
Code, it can be decoded according to all single-tone elements and be decoded according to word in collection simultaneously, word identification in collection can be effectively realized
Outer word rejection is other with collecting, it is ensured that recognition accuracy.
Step S103, obtain word output state fraction in the collection that the synchronous decoding obtains.
During synchronous decoding, phonic signal character sequence can pass in single-tone element search network and collection in word search network simultaneously
Pass, scan for decoding, and when the word search web search decoding in collection of phonic signal character sequence, calculating each frame
After the state transfer of phonic signal character sequence, word output state fraction in the collection of whole word is calculated.Word output state in collection
Fraction is the phonic signal character sequence and the matching probability of the triphones primitive of word search network in collection of input, and which characterizes language
The matching degree of sound signal characteristic sequence and each primitive in word search network in collection, numerical value is bigger, then shows that matching degree is higher,
That is phonic signal character sequence possibility corresponding with primitive is higher.More specifically, the primitive of word search network can be in collection
One implicit Ma Erke model, word output fraction is the hidden state calculated with Viterbi algorithm (viterbi algorithm) in collection
The joint probability of sequence and corresponding phonic signal character sequence.Viterbi algorithm is that a kind of dynamic programming algorithm is used to find
Most possible-Viterbi path-hidden state the sequence for producing observed events sequence, is particularly applicable on Markoff information source
Hereafter with hidden Markov model.Calculate and obtain word output state fraction in collection, can learn that phonic signal character sequence exists
The match condition decoded in collection in word search network.
Step S105, when word output state fraction meets preparatory condition in the collection, obtain the single-tone element dragnet
The confidence level of network and word search network synchronous decoding in the collection.
When word output state fraction meets preparatory condition in collection, as word output state fraction is more than predetermined threshold value in collection
When, obtain phonic signal character sequence synchronizes decoding in single-tone element search network with collection confidence in word search network
Degree.Preparatory condition can voluntarily be set according to the demand of user, such as accuracy of identification.Bar is preset when requiring to set according to accuracy of identification
During part, a threshold value is could be arranged to, when accuracy of identification requires high, the higher threshold value of numerical value is set, and word is defeated in the collection of acquisition
When the fraction that does well is higher than predetermined threshold value, confidence level is obtained.Single-tone element search network is made up of single-tone element primitive, and word is searched in collection
Network is made up of triphones primitive, if so when the phonetic feature of input is the phonetic feature of word in respective episode, then should
Just with collection word search for net mate, and due in collection word search network use triphones modeling (contain up and down
Literary information), its output state fraction can be higher or very close than the transmission fraction that single-tone element random search network obtains;And
When the phonetic feature sequence of input is to collect outer word, because single-tone element search network has contained all lists for forming any word
Phoneme, so it also can have good matching with collecting outer word feature, but word searches for network due to being by the collection that determines in collecting
Triphones that word includes composition, then the outer word sound characteristic sequence of collection can obtain poor output state fraction, they it
Between this relation be exactly our confidence levels to be measured.In the present embodiment, single-tone element search network and word search network in collection are same
The confidence level of step decoding can specifically take the inverse of single-tone element search network and the transmission fraction ratio of word search network in collection, directly
See ground and characterize the decoding measurement result of word search network in single-tone element search network and collection, and can effectively refuse big portion
The interference of the outer word of diversity.
Step S107, voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
It is calculated in single-tone element search network and collection after the confidence level of word search network, is selected according to the confidence level corresponding
Decoding paths phonic signal character sequence is decoded, output obtain voice identification result.Further, it is contemplated that voice
The successional feature of signal, can select optimal decoding paths according to the confidence level, that is, decode the decoding of matching degree highest
Path, exported the optimal decoding paths as recognition result.More specifically, it can count and obtain respectively after confidence level is obtained
Individual word meets the corresponding frame number of confidence threshold value, and will meet the most decoding paths decision-making of frame number is optimal decoding paths, so as to
Decoding output recognition result.
Above-mentioned audio recognition method, phonic signal character sequence is searched for by word in single-tone element search network and collection respectively
Network synchronizes decoding and transmitted, when word output state fraction meets preparatory condition in the collection that word search network decoding obtains in collection
When, the confidence level that single-tone element searches for network and word search network synchronous decoding in collection is obtained, finally according to corresponding to the confidence level
Decoding paths, output obtain voice identification result.By simultaneously by phonic signal character sequence inputting single-tone element search network and
Word searches for network and carries out decoding transmission in collection, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that identification is accurately
Rate;Voice identification result is obtained further according to decoding paths corresponding to confidence level selection, it is accurate can further to improve speech recognition
Degree.
Further, Fig. 2 is defeated to distinguish phonic signal character sequence in the embodiment of audio recognition method one of the application
Enter word search network in single-tone element search network and collection, and carry out the step schematic flow sheet of synchronous decoding.As shown in Fig. 2 step
Rapid S101 is specifically included:
Step S111:By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output is obtained
State fraction.
Voice signal to be detected obtains each frame phonic signal character sequence to be detected after being handled by endpoint detection module,
When one section of phonic signal character sequence inputting decoded model is decoded, default activation single-tone element first searches for network, will
Current frame speech signal characteristic sequence, which is first inputted in single-tone element search network, carries out transmission decoding, and word search network default in collecting
Keep unactivated state.In phonic signal character sequence inputting single-tone element search network, the first frame phonic signal character sequence is
All single-tone prime models, i.e. basic-element model in single-tone element search network can be activated, and is not only to activate Jing Yin phoneme.This
When, each frame phonic signal character sequence can all excite single-tone element search network model state transfer, and it is defeated to calculate first
Do well fraction.With word output state fraction in collection similarly, the first output state fraction characterize phonic signal character sequence with
The matching degree of each single-tone element primitive in single-tone element search network.
Step S113:It is when the first output state fraction is more than the first predetermined threshold value, next frame voice signal is special
Sign sequence inputs word search network in the single-tone element search network and the collection and synchronizes decoding respectively.
When phonic signal character sequence decodes transmission in single-tone element search network, it can simultaneously calculate and export all outputs
Maximum in the output state fraction of state.During concrete application, searched in each frame phonic signal character sequence in single-tone element
Decoded in network in transmittance process, calculate and count the output state that all phonic signal character sequences match with all single-tones element
Fraction, preserve and export the maximum in all output state fractions, as the institute compared with the first predetermined threshold value
State the first output state fraction.When the first output state fraction of the current frame speech signal characteristic sequence of output is pre- more than first
If during threshold value, word search network in activation collection.In activation collection after word search network, the phonic signal character sequence of next frame can be same
When input single-tone element search network and collection in word search network in keep synchronous decoding.By first that phonic signal character sequence is defeated
Decoding identification is carried out in the single-tone element search network for entering to match all words, when recognition result meets preparatory condition, then will
Word search network synchronizes decoding, Ke Yi in phonic signal character sequence frame input single-tone element search network and collection afterwards
Suitable opportunity incision synchronizes decoding, is advantageous to improve the efficiency and accuracy rate of speech recognition.
Further, Fig. 3 is defeated to distinguish phonic signal character sequence in the embodiment of audio recognition method one of the application
Enter word search network in single-tone element search network and collection, and carry out the step schematic flow sheet of synchronous decoding.As shown in figure 3, this
Step S111 includes in embodiment:
Step S111a, by single-tone element search network described in the current frame speech signal characteristic sequence inputting.
When phonic signal character sequence inputting decoded model transmit decoding, single-tone element dragnet is only activated first
Network, and word search network default keeps unactivated state in collecting.Now, phonic signal character sequence can only be transfused to single-tone element and search
Transmission decoding is carried out in rope network.
Step S111b, obtain the current frame speech signal characteristic sequence and the connection of single-tone element search network primitive
Close probability.
When phonic signal character sequence transmit decoding in single-tone element search network, each frame phonic signal character is calculated
Sequence and the matching degree of single-tone element search each primitive of network, i.e. joint probability.Wherein, single-tone element search network is all single-tones
The News Search network of element composition, it comprises all phonemes for forming any word, and all single-tone elements are without context
Information, so single-tone element search network can match word in Arbitrary Sets and collect outer word, single-tone element search network primitive is institute
Some single-tone prime models.In specific implementation, phoneme model, i.e. primitive can be HMM (hidden Ma Erke models).HMM is to language
The time series structure of sound signal establishes statistical model, is regarded as a dual random process mathematically:One is apparatus
The Markov chain for having finite state number carrys out the implicit (internal state of Markov model of analog voice signal statistical property change
It is extraneous invisible) random process, another is the extraneous visible observation sequence associated with each state of Markov chain
The random process of row (being exactly generally from the calculated acoustic feature of each frame).The HMM configuration state numbers of each phoneme model
Phoneme incipient stage, phoneme stabilization sub stage, phoneme ending phase are set to, speech recognition process is each state of each phoneme model
Transfer process.Further, HMM hidden states sequence and corresponding current frame speech signal are calculated by viterbi algorithm
The joint probability of characteristic sequence, the joint probability characterize current frame speech signal characteristic sequence and single-tone element search each base of network
The matching degree of member, i.e. current frame speech signal characteristic sequence may be the possibility size of some single-tone element.
Step S111c, using the maximum in the joint probability as the first output state fraction.
Obtain each frame phonic signal character sequence and the matching degree of single-tone element search each primitive of network, i.e. joint probability
Afterwards, joint probability is counted, obtains maximum therein, and using the joint probability maximum as the first output state point
Number output.Each frame phonic signal character sequence can be matched with all single-tone elements in single-tone element search network, be obtained
Multiple joint probabilities, the coupling path of wherein maximum joint probability is selected to carry out transmission decoding.First output state fraction characterizes
From the joint probability for starting identification and transmitting always the whole decoding paths for being decoded to current state.For transmitting solution each time
Code, phonic signal character sequence with all primitive matching primitives joint probabilities, but only retain maximum joint probability state
With result, after the completion of the transmission of all phonic signal character sequences, it can be ensured that get maximum joint probability respective path
First output state fraction.Now again by the first output state fraction compared with the first predetermined threshold value, determine whether by
Next frame characteristic sequence is cut into synchronous decoding, can be cut on suitable opportunity and synchronize decoding, be advantageous to improve voice
The efficiency and accuracy rate of identification.
Further, step S105 can include:
Step 1:When word output state fraction meets the preparatory condition in the collection, the single-tone element search is obtained
The first of Network Synchronization decoding transmits the second transmission fraction of word search network synchronous decoding in fraction and the collection.
Phonic signal character sequence decoded result in single-tone element search network meets word search network activation in default collection
During condition, word search network in collection can be activated and synchronize decoding, now, the phonic signal character sequence of next frame can be simultaneously defeated
Enter word search network in single-tone element search network and collection.When word search network scans for decoding in collection, calculate each
After the state transfer of frame, the output state fraction of whole word is checked.Similarly in the first output state of single-tone element search network
Fraction, the output state fraction in collection in word search network decoding characterize phonic signal character sequence and word dragnet in collection
Network.Further, output state fraction can be by calculating phonic signal character sequence and the connection of word search network primitive in collection
Probability is closed to obtain.Wherein, the News Search network that word search network is made up of triphones in collection, its primitive are to contain phoneme
Between contextual information triphones.Because the triphones primitive of word search network contains contextual information in collection, belong to
Output state fraction of the phonic signal character sequence of word in collection in word search network is relative to single-tone element search network in collection
Can be higher.It is opposite, when it is incoming belong to the phonic signal character sequence for collect outer word when, even if having activated collect in word search for network,
Its output state fraction relative to single-tone element search network also can than relatively low, and whether be collection in word phonic signal character sequence
Do not have too big influence substantially to single-tone element search network, so when input belongs to the phonic signal character for collecting outer word, compared to collection
Interior word searches for network, and single-tone element search network can obtain higher output state fraction and transmit fraction;When input belongs in collection
During the phonic signal character sequence of word, network is searched for compared to word in collection, single-tone element search network can obtain lower output shape
State fraction and transmission fraction.When the voice signal of only input belongs to word in collection in theory, passed in collection obtained by word search network
Pass fraction and output state fraction is possible to transmit fraction and output shape corresponding to even more than single-tone element search network
State fraction.When in collection word search network output state fraction meet preparatory condition when, such as larger than predetermined threshold value when, calculate from
Word searches for network and transmits fraction to pass out this network second in into this collection, then calculates and searched with the single-tone element of its same entrance
First during this section of rope network transmits fraction.Further, when word output state fraction is more than given threshold in collection, record
The history transmission information of lower output state fraction and output state, can finding activation by history transmission information, this transmits road
The active information in footpath, and then whole word start frame can be drawn and transmit fraction.Additionally to calculate the single-tone element of present frame
The first output state fraction of network is searched for, and the word search network information is calculated in search term list this period in coactivation collection
Phoneme searches for the transmission fraction of network.It can be the transmission fraction during alternative space (token passing) to transmit fraction,
It has recorded the output state fraction in certain section of section in complete decoding paths, and this section refers to decoding mould from entrance here
For type to the section of output decoded model, the matching transmitted in decoding process in some section can be intuitively illustrated in by transmitting fraction
Degree.
Step 2:Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
Word search network is carried out same to identical frame length phonic signal character sequence in obtaining single-tone element search network and collecting
After the transmission fraction for walking decoding, fraction is transmitted as fraction is referred to using the first of single-tone element search network, considers word in collection
Second transmission fraction of search network acquires confidence level.Further, confidence level can be defined as the second transmission fraction and
First transmits the ratio of fraction, and which characterizes the matching credibility in each path of phonic signal character sequence of input.
Further, step S105 can also include:
Step 1:When word output state fraction is more than the second predetermined threshold value in the collection, pass through network topology point
Described first is not obtained transmits fraction and the second transmission fraction.
By word output state fraction in the collection of acquisition compared with the second predetermined threshold value, when word output state fraction is more than in collection
During the second predetermined threshold value, obtain word in the first transmission fraction and collection of single-tone element search network respectively by network topology and search
The second of rope network transmits fraction.Further, phonic signal character sequence is being entered including single-tone element search network and collection
Can generate a token before the word network to be identified of interior word search network, it have recorded traceback information and transmits fraction, when up to
To output state fraction is calculated during word network output state to be identified, then subtracted using output state fraction and traced back to by token
Input point fraction just obtains the transmission fraction during word network to be identified.Transmit fraction essential record is certain section in fullpath
Output state fraction in section, here this section refer to from word network to be identified is entered to exporting word network to be identified.
Further, network is searched for different from word in collection, single-tone element search network is only transmitted using initialization token, do not given birth to
New token is produced, but the token traceback information that network can be searched for according to word in collection obtains the transmission point of identical transfer stages
Number.Specifically, the first transmission fraction of single-tone element search network is maximum by current single-tone element search network in same time period
Output state fraction subtracts in collection that token traces back to what the output state fraction of input point obtained in word search network, because single-tone
A unique optimal path is only remained in the search procedure of element search network.Input language can be obtained by network topology
After sound signal characteristic sequence, the first of single-tone element search network synchronous decoding transmits word search network synchronous decoding in fraction and collection
Second transmit fraction.
Step 2:Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
After obtaining the first transmission fraction and the second transmission fraction, fraction is transmitted as reference using the first of single-tone element search network
Fraction, the ratio of the second transmission fraction and the first transmission fraction is defined as confidence level, the voice signal for characterizing input is special
Levy the matching credibility in sequence each path in word search network in single-tone element search network and collection.Now, confidence value is got over
Small expression matching confidence is higher.Similarly, when being defined as confidence level with the ratio of the first transmission fraction and the second transmission fraction,
Confidence value is more big, shows that matching confidence is higher.According to the transmission fraction of word search network in single-tone element search network and collection
Ratio carry out confidence level definition, recycle the confidence level be identified result selection, can be very good refusal and fall major part
The interference of the outer word of collection, it is ensured that the degree of accuracy of identification.
Further, step S107 can include:
Step 1:Acquisition meets the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition
Frame number.
Traditional is directly compared by the confidence level between different words, and recognition result effect is determined with optimal confidence level
Means are not ideal, and actually voice signal has continuity, so when phonetic feature sequence is with search net mate,
Very high fraction can be obtained by not only having a frame, it may appear that multiframe is such.In view of the successional feature of voice signal, obtaining
After taking confidence level, statistics meets the frame number of phonic signal character sequence corresponding to the confidence level of confidence threshold value condition.Specifically,
For each decoding paths, it meets that the frame number of confidence threshold value condition is different, counts in each path and meet confidence threshold value
The frame number of condition.Word search network is independent in each collection, and they have different phonemes and phoneme number, therefore to be
Word sets respective confidence threshold value in each collection.Searched when confidence level is defined as collecting interior word search network delivery fraction with single-tone element
It is considered to collect outer word pronunciation and refuse identification during the ratio of rope network delivery fraction, during higher than threshold value, can be protected during less than threshold value
Remain for integrated decision-making.
Step 2:According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
After obtaining the frame number for meeting confidence threshold value condition in each decoding paths, according to the most solution of the frame number for the condition that meets
Decoding is identified in code path, exports and obtains voice identification result.Confidence calculations result and word bag in collection in the present embodiment
Number containing phoneme has a direct contact, the optimal confidence level that different words calculates be it is different, therefore, the confidence between different words
Degree, which is directly compared, determines that recognition result effect is not ideal.According to optimal confidence level difference between word and voice signal
Continuity, the output number that each word meets confidence threshold value is obtained by decoding statistics first, met eventually through comparing
The most persons of number are as recognition result.When recognition result selects, the continuity features of voice signal are considered, selection passes through solution
Code statistics obtains the output frame number that each word meets confidence threshold value, meets the most persons of frame number as identification knot eventually through comparing
Fruit, speech recognition accuracy can be effectively improved.
Further, Fig. 3 is the schematic flow sheet of step before step S101 in an embodiment, as shown in figure 3, step
Include before S101:
Step S101a, obtain voice signal.
Voice signal can be realized by speech collecting system.Specifically, voice collecting can be carried out by sound pick-up such as microphone,
Processing acquisition is carried out by amplifier and wave filter again.
Step S101b, end-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
The primary speech signal directly gathered by speech collecting system includes many unessential information and ambient noise,
Need to do a series of pretreatment to primary speech signal sequence by endpoint detection module, as end-point detection (determines that voice is believed
Number the whole story), (be approximately considered is in 10-30ms for framing for pre-filtering (remove individual pronunciation difference and ambient noise etc. influence)
Voice signal is short-term stationarity, and voice signal is divided into one section one section is analyzed), adding window (using stationary process analyze
Processing method is handled), preemphasis (lifting HFS) and Fourier transformation (being converted into data signal to be easy to handle) etc.
Pretreatment.Voice signal by that can obtain input decoded model after pretreatment transmit the phonic signal character sequence of decoding
Row.
Further, Fig. 4 is the schematic flow sheet of step S101b in an embodiment.As shown in figure 4, voice signal inputs,
Handled first by framing, adding window and preemphasis, then carry out Fast Fourier Transform (FFT) (FFT), further pass through triangle window filtering
Device, carry out noise power estimation and smooth power estimation respectively to filtered output signal.Wherein, during noise power estimation,
Signal to noise ratio (SNR) is calculated, judges whether signal to noise ratio is more than threshold value, if it is not, then returning to voice signal obtaining step;Smooth power is estimated
Timing, the difference of actual power and estimate is calculated, when difference is less than threshold value, return to voice signal obtaining step.Work as noise
The signal ratio of power estimation is when being more than threshold value, and when the difference estimated with smooth power of actual power is not less than threshold value, to signal
Discrete cosine transform (DCT) is carried out, output obtains phonic signal character sequence.
In addition, the application also provides a kind of speech recognition system.Fig. 5 is the embodiment of speech recognition system one of the application
Structural representation.As shown in figure 5, a kind of speech recognition system, including:
Synchronous decoding module 100, for phonic signal character sequence to be inputted into word in single-tone element search network and collection respectively
Network is searched for, and carries out synchronous decoding;
State fraction acquisition module 300, word output state fraction in the collection obtained for obtaining the synchronous decoding;
Confidence level acquisition module 500, for when word output state fraction meets preparatory condition in the collection, described in acquisition
The confidence level of single-tone element search network and word search network synchronous decoding in the collection;
Speech recognition output module 700, voice is obtained for decoding paths, output according to corresponding to confidence level selection
Recognition result.
Above-mentioned speech recognition system, phonic signal character sequence is passed through into single-tone element dragnet respectively by synchronous decoding module
Word search network synchronizes decoding and transmitted in network and collection, when word search network decoding in the collection that state fraction acquisition module obtains
Collection in word output state fraction when meeting preparatory condition, single-tone element search network and word in collection are obtained by confidence level acquisition module
Search for network synchronous decoding confidence level, finally as speech recognition output module according to corresponding to the confidence level decoding paths, it is defeated
Go out to obtain voice identification result.By simultaneously by word dragnet in phonic signal character sequence inputting single-tone element search network and collection
Network carries out decoding transmission, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that recognition accuracy;Further according to confidence
Decoding paths corresponding to degree selection obtain voice identification result, can further improve speech discrimination accuracy.
Further, a kind of computer equipment is also provided, the computer equipment includes memory, processor and is stored in
On reservoir and the computer program that can run on a processor, wherein, each reality as described above is realized during computing device described program
Apply any one audio recognition method in example.
The computer equipment, during its computing device program, pass through any one language realized in each embodiment as described above
Voice recognition method, enter so as to which phonic signal character sequence is searched for into network by word in single-tone element search network and collection respectively
Row synchronous decoding transmission, when word output state fraction meets preparatory condition in the collection that word search network decoding obtains in collection, obtain
The confidence level of single-tone element search network and word search network synchronous decoding in collection is taken, finally decodes road according to corresponding to the confidence level
Footpath, output obtain voice identification result.By simultaneously by word in phonic signal character sequence inputting single-tone element search network and collection
Search network carries out decoding transmission, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that recognition accuracy;Root again
According to confidence level select corresponding to decoding paths obtain voice identification result, can further improve speech discrimination accuracy.
In addition, one of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method,
It is that by computer program the hardware of correlation can be instructed to complete, described program can be stored in a non-volatile calculating
In machine read/write memory medium, in the embodiment of the present application, the program can be stored in the storage medium of computer system, and by
At least one computing device in the computer system, to realize the stream for including the embodiment of each audio recognition method as described above
Journey.
Further, a kind of storage medium is also provided, is stored thereon with computer program, wherein, the program is by processor
Any one audio recognition method in each embodiment as described above is realized during execution.Wherein, described storage medium can be magnetic disc,
CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random
AccessMemory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality
Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, the scope that this specification is recorded all is considered to be.
Embodiment described above only expresses the several embodiments of the application, and its description is more specific and detailed, but simultaneously
Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art
Say, on the premise of the application design is not departed from, various modifications and improvements can be made, these belong to the protection of the application
Scope.Therefore, the protection domain of the application patent should be determined by the appended claims.
Claims (10)
- A kind of 1. audio recognition method, it is characterised in that including:Phonic signal character sequence is inputted into word search network in single-tone element search network and collection respectively, and carries out synchronous decoding;Obtain word output state fraction in the collection that the synchronous decoding obtains;When word output state fraction meets preparatory condition in the collection, the single-tone element search network and word in the collection are obtained Search for the confidence level of network synchronous decoding;Voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
- 2. audio recognition method according to claim 1, it is characterised in that described that phonic signal character sequence difference is defeated The step of entering word search network in single-tone element search network and collection, and carrying out synchronous decoding includes:By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output state fraction is obtained;When the first output state fraction is more than the first predetermined threshold value, next frame phonic signal character sequence is inputted respectively Word search network synchronizes decoding in the single-tone element search network and the collection.
- 3. audio recognition method according to claim 2, it is characterised in that described by current frame speech signal characteristic sequence The step of inputting the single-tone element search network, obtaining the first output state fraction includes:By single-tone element search network described in the current frame speech signal characteristic sequence inputting;Obtain the current frame speech signal characteristic sequence and the joint probability of single-tone element search network primitive;Using the maximum in the joint probability as the first output state fraction.
- 4. audio recognition method according to claim 1, it is characterised in that described when word output state fraction in the collection When meeting preparatory condition, the single-tone element search network and the step of the confidence level of word search network synchronous decoding in the collection are obtained Suddenly include:When word output state fraction meets the preparatory condition in the collection, the single-tone element search network synchronous decoding is obtained First transmit second of word search network synchronous decoding in fraction and the collection and transmit fraction;Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
- 5. audio recognition method according to claim 4, it is characterised in that described when word output state fraction in the collection When meeting preparatory condition, the single-tone element search network and the step of the confidence level of word search network synchronous decoding in the collection are obtained Suddenly include:When word output state fraction is more than the second predetermined threshold value in the collection, described is obtained respectively by network topology One, which transmits fraction and described second, transmits fraction;Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
- 6. audio recognition method according to claim 1, it is characterised in that described according to corresponding to confidence level selection Decoding paths, exporting the step of obtaining voice identification result includes:Obtain the frame number for meeting the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition;According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
- 7. audio recognition method according to claim 1, it is characterised in that described that phonic signal character sequence difference is defeated Include before the step of entering word search network in single-tone element search network and collection, and carrying out synchronous decoding:Obtain voice signal;End-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
- A kind of 8. speech recognition system, it is characterised in that including:Synchronous decoding module, for phonic signal character sequence to be inputted into word dragnet in single-tone element search network and collection respectively Network, and carry out synchronous decoding;State fraction acquisition module, word output state fraction in the collection obtained for obtaining the synchronous decoding;Confidence level acquisition module, for when word output state fraction meets preparatory condition in the collection, obtaining the single-tone element Search for the confidence level of network and word search network synchronous decoding in the collection;Speech recognition output module, speech recognition knot is obtained for decoding paths, output according to corresponding to confidence level selection Fruit.
- 9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, it is characterised in that realize such as claim 1 to 7 described in the computing device during computer program Audio recognition method described in any one.
- 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The audio recognition method as described in claim 1 to 7 any one is realized during execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711031665.9A CN107871499B (en) | 2017-10-27 | 2017-10-27 | Speech recognition method, system, computer device and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711031665.9A CN107871499B (en) | 2017-10-27 | 2017-10-27 | Speech recognition method, system, computer device and computer-readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107871499A true CN107871499A (en) | 2018-04-03 |
CN107871499B CN107871499B (en) | 2020-06-16 |
Family
ID=61753362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711031665.9A Active CN107871499B (en) | 2017-10-27 | 2017-10-27 | Speech recognition method, system, computer device and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107871499B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109273007A (en) * | 2018-10-11 | 2019-01-25 | 科大讯飞股份有限公司 | Voice awakening method and device |
CN110111775A (en) * | 2019-05-17 | 2019-08-09 | 腾讯科技(深圳)有限公司 | A kind of Streaming voice recognition methods, device, equipment and storage medium |
WO2019214361A1 (en) * | 2018-05-08 | 2019-11-14 | 腾讯科技(深圳)有限公司 | Method for detecting key term in speech signal, device, terminal, and storage medium |
CN111862943A (en) * | 2019-04-30 | 2020-10-30 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
CN112331219A (en) * | 2020-11-05 | 2021-02-05 | 北京爱数智慧科技有限公司 | Voice processing method and device |
CN112652306A (en) * | 2020-12-29 | 2021-04-13 | 珠海市杰理科技股份有限公司 | Voice wake-up method and device, computer equipment and storage medium |
CN114783438A (en) * | 2022-06-17 | 2022-07-22 | 深圳市友杰智新科技有限公司 | Adaptive decoding method, apparatus, computer device and storage medium |
CN115831100A (en) * | 2023-02-22 | 2023-03-21 | 深圳市友杰智新科技有限公司 | Voice command word recognition method, device, equipment and storage medium |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100679051B1 (en) * | 2005-12-14 | 2007-02-05 | 삼성전자주식회사 | Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms |
CN101030369B (en) * | 2007-03-30 | 2011-06-29 | 清华大学 | Built-in speech discriminating method based on sub-word hidden Markov model |
CN101604520A (en) * | 2009-07-16 | 2009-12-16 | 北京森博克智能科技有限公司 | Spoken language voice recognition method based on statistical model and syntax rule |
CN101763855B (en) * | 2009-11-20 | 2012-01-04 | 安徽科大讯飞信息科技股份有限公司 | Method and device for judging confidence of speech recognition |
CN105321518B (en) * | 2014-08-05 | 2018-12-04 | 中国科学院声学研究所 | A kind of rejection method for identifying of low-resource Embedded Speech Recognition System |
CN105161096B (en) * | 2015-09-22 | 2017-05-10 | 百度在线网络技术(北京)有限公司 | Speech recognition processing method and device based on garbage models |
CN106683677B (en) * | 2015-11-06 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN106782513B (en) * | 2017-01-25 | 2019-08-23 | 上海交通大学 | Speech recognition realization method and system based on confidence level |
-
2017
- 2017-10-27 CN CN201711031665.9A patent/CN107871499B/en active Active
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11341957B2 (en) | 2018-05-08 | 2022-05-24 | Tencent Technology (Shenzhen) Company Limited | Method for detecting keyword in speech signal, terminal, and storage medium |
WO2019214361A1 (en) * | 2018-05-08 | 2019-11-14 | 腾讯科技(深圳)有限公司 | Method for detecting key term in speech signal, device, terminal, and storage medium |
CN109273007B (en) * | 2018-10-11 | 2022-05-17 | 西安讯飞超脑信息科技有限公司 | Voice wake-up method and device |
CN109273007A (en) * | 2018-10-11 | 2019-01-25 | 科大讯飞股份有限公司 | Voice awakening method and device |
CN111862943A (en) * | 2019-04-30 | 2020-10-30 | 北京地平线机器人技术研发有限公司 | Speech recognition method and apparatus, electronic device, and storage medium |
CN111862943B (en) * | 2019-04-30 | 2023-07-25 | 北京地平线机器人技术研发有限公司 | Speech recognition method and device, electronic equipment and storage medium |
CN110111775A (en) * | 2019-05-17 | 2019-08-09 | 腾讯科技(深圳)有限公司 | A kind of Streaming voice recognition methods, device, equipment and storage medium |
CN112331219A (en) * | 2020-11-05 | 2021-02-05 | 北京爱数智慧科技有限公司 | Voice processing method and device |
CN112331219B (en) * | 2020-11-05 | 2024-05-03 | 北京晴数智慧科技有限公司 | Voice processing method and device |
CN112652306A (en) * | 2020-12-29 | 2021-04-13 | 珠海市杰理科技股份有限公司 | Voice wake-up method and device, computer equipment and storage medium |
CN112652306B (en) * | 2020-12-29 | 2023-10-03 | 珠海市杰理科技股份有限公司 | Voice wakeup method, voice wakeup device, computer equipment and storage medium |
CN114783438A (en) * | 2022-06-17 | 2022-07-22 | 深圳市友杰智新科技有限公司 | Adaptive decoding method, apparatus, computer device and storage medium |
CN114783438B (en) * | 2022-06-17 | 2022-09-27 | 深圳市友杰智新科技有限公司 | Adaptive decoding method, apparatus, computer device and storage medium |
CN115831100A (en) * | 2023-02-22 | 2023-03-21 | 深圳市友杰智新科技有限公司 | Voice command word recognition method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107871499B (en) | 2020-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107871499A (en) | Audio recognition method, system, computer equipment and computer-readable recording medium | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
JP4195428B2 (en) | Speech recognition using multiple speech features | |
CN101930735B (en) | Speech emotion recognition equipment and speech emotion recognition method | |
CN102723078B (en) | Emotion speech recognition method based on natural language comprehension | |
CN107369439B (en) | Voice awakening method and device | |
CN105206258A (en) | Generation method and device of acoustic model as well as voice synthetic method and device | |
CN105895078A (en) | Speech recognition method used for dynamically selecting speech model and device | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
Brandes | Feature vector selection and use with hidden Markov models to identify frequency-modulated bioacoustic signals amidst noise | |
CN111105785B (en) | Text prosody boundary recognition method and device | |
CN110265063B (en) | Lie detection method based on fixed duration speech emotion recognition sequence analysis | |
CN112967725A (en) | Voice conversation data processing method and device, computer equipment and storage medium | |
CN112735477B (en) | Voice emotion analysis method and device | |
CN116580706B (en) | Speech recognition method based on artificial intelligence | |
CN109065073A (en) | Speech-emotion recognition method based on depth S VM network model | |
CN111341319A (en) | Audio scene recognition method and system based on local texture features | |
CN111883181A (en) | Audio detection method and device, storage medium and electronic device | |
Kharamat et al. | Durian ripeness classification from the knocking sounds using convolutional neural network | |
CN115171731A (en) | Emotion category determination method, device and equipment and readable storage medium | |
CN108364655A (en) | Method of speech processing, medium, device and computing device | |
Ling | An acoustic model for English speech recognition based on deep learning | |
CN118136022A (en) | Intelligent voice recognition system and method | |
CN102141812A (en) | Robot | |
CN111402887A (en) | Method and device for escaping characters by voice |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: 519000 No. 333, Kexing Road, Xiangzhou District, Zhuhai City, Guangdong Province Patentee after: ZHUHAI JIELI TECHNOLOGY Co.,Ltd. Address before: Floor 1-107, building 904, ShiJiHua Road, Zhuhai City, Guangdong Province Patentee before: ZHUHAI JIELI TECHNOLOGY Co.,Ltd. |