CN107871499A - Audio recognition method, system, computer equipment and computer-readable recording medium - Google Patents

Audio recognition method, system, computer equipment and computer-readable recording medium Download PDF

Info

Publication number
CN107871499A
CN107871499A CN201711031665.9A CN201711031665A CN107871499A CN 107871499 A CN107871499 A CN 107871499A CN 201711031665 A CN201711031665 A CN 201711031665A CN 107871499 A CN107871499 A CN 107871499A
Authority
CN
China
Prior art keywords
collection
word
search network
fraction
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711031665.9A
Other languages
Chinese (zh)
Other versions
CN107871499B (en
Inventor
秦浩然
肖全之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Jieli Technology Co Ltd
Original Assignee
Zhuhai Jieli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Jieli Technology Co Ltd filed Critical Zhuhai Jieli Technology Co Ltd
Priority to CN201711031665.9A priority Critical patent/CN107871499B/en
Publication of CN107871499A publication Critical patent/CN107871499A/en
Application granted granted Critical
Publication of CN107871499B publication Critical patent/CN107871499B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application is related to a kind of audio recognition method, system, computer equipment and storage medium.Including word search network in phonic signal character sequence inputting single-tone element search network and collection is synchronized into decoding;Obtain word output state fraction in the collection that word search network decoding obtains in collection;When word output state fraction meets preparatory condition in collection, the confidence level of single-tone element search network and word search network synchronous decoding in collection is obtained;The decoding paths according to corresponding to selecting confidence level, output obtain voice identification result.Above-mentioned audio recognition method, system, computer equipment and computer-readable recording medium, decoding transmission is carried out by the way that word in phonic signal character sequence inputting single-tone element search network and collection is searched for into network simultaneously, it is other that word identification and the outer word rejection of collection in collection can be effectively realized, it is ensured that recognition accuracy;Voice identification result is obtained further according to decoding paths corresponding to confidence level selection, can further improve speech discrimination accuracy.

Description

Audio recognition method, system, computer equipment and computer-readable recording medium
Technical field
The application is related to technical field of voice recognition, more particularly to a kind of audio recognition method, system, computer equipment And computer-readable recording medium.
Background technology
With the fast development and application of computer technology, further realize that it is artificial intelligence to carry out speech exchange with machine With an important directions of machine learning application, speech recognition technology is exactly to allow machine that voice is believed by identification and understanding process Number it is changed into the technology of corresponding text or order.The application of speech recognition at present can be broadly divided into both direction:One is big Vocabulary continuous speech identifying system, it is applied to mobile phone assistant, voice dictation etc.;Another is to small vocabulary pocket language Sound production development, such as intelligent toy, household remote etc..
Small vocabulary speech identifying system in wherein second application gradually starts to obtain in fields such as handheld terminal, household electrical appliances To application because its towards be small vocabulary, relative to the first system in addition to the influence that noise jamming is brought will also Consider the interference of the substantial amounts of outer word of collection, i.e., to ensure to collect will also refuse to collect outer word while interior word correctly identifies.It is and traditional small The product using effect of vocabulary speech identifying system is still not fully up to expectations, can not such as effectively realize the identification of order word and collection in collection Outer word rejection is other, and speech discrimination accuracy is low.
The content of the invention
Based on this, it is necessary in view of the above-mentioned problems, word identification and the outer word rejection of collection in collection can be effectively realized by providing one kind Not, audio recognition method, system, computer equipment and the computer-readable recording medium of recognition accuracy are improved.
A kind of audio recognition method, including:
Phonic signal character sequence is inputted into word search network in single-tone element search network and collection respectively, and synchronizes solution Code;
Obtain word output state fraction in the collection that the synchronous decoding obtains;
When word output state fraction meets preparatory condition in the collection, the single-tone element search network and the collection are obtained The confidence level of interior word search network synchronous decoding;
Voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
In one embodiment, it is described by phonic signal character sequence input respectively single-tone element search network and collection in word search Rope network, and include the step of carry out synchronous decoding:
By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output state fraction is obtained;
When the first output state fraction is more than the first predetermined threshold value, next frame phonic signal character sequence is distinguished Input word search network in the single-tone element search network and the collection and synchronize decoding.
In one embodiment, it is described that single-tone element described in current frame speech signal characteristic sequence inputting is searched for into network, obtain Include to the step of the first output state fraction:
By single-tone element search network described in the current frame speech signal characteristic sequence inputting;
Obtain the current frame speech signal characteristic sequence and the joint probability of single-tone element search network primitive;
Using the maximum in the joint probability as the first output state fraction.
In one embodiment, it is described when word output state fraction meets preparatory condition in the collection, obtain the list Phoneme searches for network with including in the collection the step of confidence level of word search network synchronous decoding:
When word output state fraction meets the preparatory condition in the collection, the single-tone element search Network Synchronization is obtained The first of decoding transmits the second transmission fraction of word search network synchronous decoding in fraction and the collection;
Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
In one embodiment, it is described when word output state fraction meets preparatory condition in the collection, obtain the list Phoneme searches for network with including in the collection the step of confidence level of word search network synchronous decoding:
When word output state fraction is more than the second predetermined threshold value in the collection, institute is obtained respectively by network topology State the first transmission fraction and described second and transmit fraction;
Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
In one embodiment, the decoding paths according to corresponding to confidence level selection, output obtain speech recognition As a result the step of, includes:
Obtain the frame number for meeting the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition;
According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
In one embodiment, it is described by phonic signal character sequence input respectively single-tone element search network and collection in word search Rope network, and the step of carry out synchronous decoding before include:
Obtain voice signal;
End-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
A kind of speech recognition system, including:
Synchronous decoding module, searched for for phonic signal character sequence to be inputted into word in single-tone element search network and collection respectively Network, and carry out synchronous decoding;
State fraction acquisition module, word output state fraction in the collection obtained for obtaining the synchronous decoding;
Confidence level acquisition module, for when word output state fraction meets preparatory condition in the collection, obtaining the list Phoneme searches for the confidence level of network and word search network synchronous decoding in the collection;
Speech recognition output module, for the decoding paths according to corresponding to confidence level selection, output obtains voice knowledge Other result.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processor The computer program of upper operation, audio recognition method as described above is realized described in the computing device during computer program.
A kind of computer-readable recording medium, is stored thereon with computer program, and the program is realized when being executed by processor Audio recognition method as described above.
Above-mentioned audio recognition method, system, computer equipment and computer-readable recording medium, by phonic signal character sequence Arrange to synchronize to decode by word search network in the plain search network of single-tone and collection respectively and transmit, when word searches for network decoding in collection When word output state fraction meets preparatory condition in obtained collection, single-tone element search network and word search Network Synchronization in collection are obtained The confidence level of decoding, the finally decoding paths according to corresponding to the confidence level, output obtain voice identification result.By simultaneously by language Word search network carries out decoding transmission in sound signal characteristic sequence input single-tone element search network and collection, can effectively realize in collection Word identifies and the outer word rejection of collection is other, it is ensured that recognition accuracy;Voice is obtained further according to decoding paths corresponding to confidence level selection to know Other result, can further improve speech discrimination accuracy.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the embodiment of audio recognition method one of the application;
Fig. 2 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search Word search network in rope network and collection, and carry out the step schematic flow sheet of synchronous decoding;
Fig. 3 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search Word search network in rope network and collection, and carry out the step schematic flow sheet of synchronous decoding;
Fig. 4 is that phonic signal character sequence is inputted into single-tone element respectively in the embodiment of audio recognition method one of the application to search Word search network in rope network and collection, and the step of carry out synchronous decoding before schematic flow sheet;
Fig. 5 be the application the embodiment of audio recognition method one in end-point detection schematic flow sheet;
Fig. 6 is the structural representation of the embodiment of speech recognition system one of the application.
Embodiment
For the object, technical solution and advantage of the application are more clearly understood, below in conjunction with drawings and Examples, to this Application is described in further detail.It should be appreciated that embodiment described herein is only to explain the application, The protection domain of the application is not limited.
Fig. 1 is the schematic flow sheet of the embodiment of audio recognition method one of the application.As shown in figure 1, the language of the present embodiment Voice recognition method, including:
Step S101, phonic signal character sequence is inputted into word in single-tone element search network and collection respectively and searches for network, and Synchronize decoding.
Voice is one kind of sound, is sent by the phonatory organ of people, have certain grammer and meaning, carry it is specific The analog signal of information.Voice signal is analog quantity, so the processing to voice signal is needed first by sampling, quantification technique The voice signal of simulation is converted into data signal, the wherein sample frequency of voice signal need to meet nyquist sampling theorem, I.e. sample frequency necessarily be greater than twice of voice signal highest frequency to be sampled.In addition, voice signal include it is numerous uncorrelated Information, such as background noise, mood etc., so having used substantial amounts of phonic signal character to join in speech recognition evolution Number, the basic thought of the extraction of characteristic parameter are that pretreated signal is removed into redundancy section by linear transformation, and in generation The characteristic parameter of predicative sound essence is released, and is finally based on carrying out speech recognition on this feature parameter again.Voice signal is entering Before row feature extraction, a series of pretreatment will be done to primary speech signal sequence by endpoint detection module, as framing, Adding window, preemphasis and Fourier transformation etc. pre-process.The characteristic parameter of voice signal includes time domain parameter, such as short-time average energy Amount, pitch period etc.;Also include frequency domain parameter, such as short-term spectrum, first three formant.It is the most frequently used in terms of speech recognition To phonetic feature be exactly mel cepstrum coefficients (MFCC), mel cepstrum coefficients are falling of being extracted in Mel scale frequencies domain Parameter is composed, Mel scales are described the nonlinear characteristic of human ear frequency, can be extracted to obtain voice signal using mel cepstrum coefficients Characteristic sequence.
The News Search net that single-tone element search network is made up of all single phonemes for forming any word as primitive Network, this network can be used for start collection in word search network, and with collection word search network together with decode for identify it is other with rejection. Phoneme be according to the natural quality of voice mark off come least speech unit, articulation forms a phoneme, such as " ba " includes " b ", " a " two articulations, is two single phonemes.But the sounding (waveform) of a word actually depends on very It is multifactor, rather than just phoneme, such as phoneme context, speaker, voice style etc..It is single when considering these factors Phoneme, which can be placed in context, to be considered, material is thus formed ternary phoneme or polynary phoneme.Word search network is in collection Based on the News Search network of triphones primitive composition, the contextual information between phoneme has been contained in triphones, for from Order word in search collection in phonic signal character sequence.The phonic signal character sequence for extracting to obtain from voice signal is distinguished In inputting single-tone element search network simultaneously and collecting in word search network, decoding is synchronized.During synchronous decoding, by setting together Signal is walked, phonic signal character sequence word search network in single-tone element network and collection is kept while is scanned for decoding and shape State transmission.Phonic signal character sequence is inputted simultaneously in single-tone element search network and collection and synchronize solution in word search network Code, it can be decoded according to all single-tone elements and be decoded according to word in collection simultaneously, word identification in collection can be effectively realized Outer word rejection is other with collecting, it is ensured that recognition accuracy.
Step S103, obtain word output state fraction in the collection that the synchronous decoding obtains.
During synchronous decoding, phonic signal character sequence can pass in single-tone element search network and collection in word search network simultaneously Pass, scan for decoding, and when the word search web search decoding in collection of phonic signal character sequence, calculating each frame After the state transfer of phonic signal character sequence, word output state fraction in the collection of whole word is calculated.Word output state in collection Fraction is the phonic signal character sequence and the matching probability of the triphones primitive of word search network in collection of input, and which characterizes language The matching degree of sound signal characteristic sequence and each primitive in word search network in collection, numerical value is bigger, then shows that matching degree is higher, That is phonic signal character sequence possibility corresponding with primitive is higher.More specifically, the primitive of word search network can be in collection One implicit Ma Erke model, word output fraction is the hidden state calculated with Viterbi algorithm (viterbi algorithm) in collection The joint probability of sequence and corresponding phonic signal character sequence.Viterbi algorithm is that a kind of dynamic programming algorithm is used to find Most possible-Viterbi path-hidden state the sequence for producing observed events sequence, is particularly applicable on Markoff information source Hereafter with hidden Markov model.Calculate and obtain word output state fraction in collection, can learn that phonic signal character sequence exists The match condition decoded in collection in word search network.
Step S105, when word output state fraction meets preparatory condition in the collection, obtain the single-tone element dragnet The confidence level of network and word search network synchronous decoding in the collection.
When word output state fraction meets preparatory condition in collection, as word output state fraction is more than predetermined threshold value in collection When, obtain phonic signal character sequence synchronizes decoding in single-tone element search network with collection confidence in word search network Degree.Preparatory condition can voluntarily be set according to the demand of user, such as accuracy of identification.Bar is preset when requiring to set according to accuracy of identification During part, a threshold value is could be arranged to, when accuracy of identification requires high, the higher threshold value of numerical value is set, and word is defeated in the collection of acquisition When the fraction that does well is higher than predetermined threshold value, confidence level is obtained.Single-tone element search network is made up of single-tone element primitive, and word is searched in collection Network is made up of triphones primitive, if so when the phonetic feature of input is the phonetic feature of word in respective episode, then should Just with collection word search for net mate, and due in collection word search network use triphones modeling (contain up and down Literary information), its output state fraction can be higher or very close than the transmission fraction that single-tone element random search network obtains;And When the phonetic feature sequence of input is to collect outer word, because single-tone element search network has contained all lists for forming any word Phoneme, so it also can have good matching with collecting outer word feature, but word searches for network due to being by the collection that determines in collecting Triphones that word includes composition, then the outer word sound characteristic sequence of collection can obtain poor output state fraction, they it Between this relation be exactly our confidence levels to be measured.In the present embodiment, single-tone element search network and word search network in collection are same The confidence level of step decoding can specifically take the inverse of single-tone element search network and the transmission fraction ratio of word search network in collection, directly See ground and characterize the decoding measurement result of word search network in single-tone element search network and collection, and can effectively refuse big portion The interference of the outer word of diversity.
Step S107, voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
It is calculated in single-tone element search network and collection after the confidence level of word search network, is selected according to the confidence level corresponding Decoding paths phonic signal character sequence is decoded, output obtain voice identification result.Further, it is contemplated that voice The successional feature of signal, can select optimal decoding paths according to the confidence level, that is, decode the decoding of matching degree highest Path, exported the optimal decoding paths as recognition result.More specifically, it can count and obtain respectively after confidence level is obtained Individual word meets the corresponding frame number of confidence threshold value, and will meet the most decoding paths decision-making of frame number is optimal decoding paths, so as to Decoding output recognition result.
Above-mentioned audio recognition method, phonic signal character sequence is searched for by word in single-tone element search network and collection respectively Network synchronizes decoding and transmitted, when word output state fraction meets preparatory condition in the collection that word search network decoding obtains in collection When, the confidence level that single-tone element searches for network and word search network synchronous decoding in collection is obtained, finally according to corresponding to the confidence level Decoding paths, output obtain voice identification result.By simultaneously by phonic signal character sequence inputting single-tone element search network and Word searches for network and carries out decoding transmission in collection, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that identification is accurately Rate;Voice identification result is obtained further according to decoding paths corresponding to confidence level selection, it is accurate can further to improve speech recognition Degree.
Further, Fig. 2 is defeated to distinguish phonic signal character sequence in the embodiment of audio recognition method one of the application Enter word search network in single-tone element search network and collection, and carry out the step schematic flow sheet of synchronous decoding.As shown in Fig. 2 step Rapid S101 is specifically included:
Step S111:By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output is obtained State fraction.
Voice signal to be detected obtains each frame phonic signal character sequence to be detected after being handled by endpoint detection module, When one section of phonic signal character sequence inputting decoded model is decoded, default activation single-tone element first searches for network, will Current frame speech signal characteristic sequence, which is first inputted in single-tone element search network, carries out transmission decoding, and word search network default in collecting Keep unactivated state.In phonic signal character sequence inputting single-tone element search network, the first frame phonic signal character sequence is All single-tone prime models, i.e. basic-element model in single-tone element search network can be activated, and is not only to activate Jing Yin phoneme.This When, each frame phonic signal character sequence can all excite single-tone element search network model state transfer, and it is defeated to calculate first Do well fraction.With word output state fraction in collection similarly, the first output state fraction characterize phonic signal character sequence with The matching degree of each single-tone element primitive in single-tone element search network.
Step S113:It is when the first output state fraction is more than the first predetermined threshold value, next frame voice signal is special Sign sequence inputs word search network in the single-tone element search network and the collection and synchronizes decoding respectively.
When phonic signal character sequence decodes transmission in single-tone element search network, it can simultaneously calculate and export all outputs Maximum in the output state fraction of state.During concrete application, searched in each frame phonic signal character sequence in single-tone element Decoded in network in transmittance process, calculate and count the output state that all phonic signal character sequences match with all single-tones element Fraction, preserve and export the maximum in all output state fractions, as the institute compared with the first predetermined threshold value State the first output state fraction.When the first output state fraction of the current frame speech signal characteristic sequence of output is pre- more than first If during threshold value, word search network in activation collection.In activation collection after word search network, the phonic signal character sequence of next frame can be same When input single-tone element search network and collection in word search network in keep synchronous decoding.By first that phonic signal character sequence is defeated Decoding identification is carried out in the single-tone element search network for entering to match all words, when recognition result meets preparatory condition, then will Word search network synchronizes decoding, Ke Yi in phonic signal character sequence frame input single-tone element search network and collection afterwards Suitable opportunity incision synchronizes decoding, is advantageous to improve the efficiency and accuracy rate of speech recognition.
Further, Fig. 3 is defeated to distinguish phonic signal character sequence in the embodiment of audio recognition method one of the application Enter word search network in single-tone element search network and collection, and carry out the step schematic flow sheet of synchronous decoding.As shown in figure 3, this Step S111 includes in embodiment:
Step S111a, by single-tone element search network described in the current frame speech signal characteristic sequence inputting.
When phonic signal character sequence inputting decoded model transmit decoding, single-tone element dragnet is only activated first Network, and word search network default keeps unactivated state in collecting.Now, phonic signal character sequence can only be transfused to single-tone element and search Transmission decoding is carried out in rope network.
Step S111b, obtain the current frame speech signal characteristic sequence and the connection of single-tone element search network primitive Close probability.
When phonic signal character sequence transmit decoding in single-tone element search network, each frame phonic signal character is calculated Sequence and the matching degree of single-tone element search each primitive of network, i.e. joint probability.Wherein, single-tone element search network is all single-tones The News Search network of element composition, it comprises all phonemes for forming any word, and all single-tone elements are without context Information, so single-tone element search network can match word in Arbitrary Sets and collect outer word, single-tone element search network primitive is institute Some single-tone prime models.In specific implementation, phoneme model, i.e. primitive can be HMM (hidden Ma Erke models).HMM is to language The time series structure of sound signal establishes statistical model, is regarded as a dual random process mathematically:One is apparatus The Markov chain for having finite state number carrys out the implicit (internal state of Markov model of analog voice signal statistical property change It is extraneous invisible) random process, another is the extraneous visible observation sequence associated with each state of Markov chain The random process of row (being exactly generally from the calculated acoustic feature of each frame).The HMM configuration state numbers of each phoneme model Phoneme incipient stage, phoneme stabilization sub stage, phoneme ending phase are set to, speech recognition process is each state of each phoneme model Transfer process.Further, HMM hidden states sequence and corresponding current frame speech signal are calculated by viterbi algorithm The joint probability of characteristic sequence, the joint probability characterize current frame speech signal characteristic sequence and single-tone element search each base of network The matching degree of member, i.e. current frame speech signal characteristic sequence may be the possibility size of some single-tone element.
Step S111c, using the maximum in the joint probability as the first output state fraction.
Obtain each frame phonic signal character sequence and the matching degree of single-tone element search each primitive of network, i.e. joint probability Afterwards, joint probability is counted, obtains maximum therein, and using the joint probability maximum as the first output state point Number output.Each frame phonic signal character sequence can be matched with all single-tone elements in single-tone element search network, be obtained Multiple joint probabilities, the coupling path of wherein maximum joint probability is selected to carry out transmission decoding.First output state fraction characterizes From the joint probability for starting identification and transmitting always the whole decoding paths for being decoded to current state.For transmitting solution each time Code, phonic signal character sequence with all primitive matching primitives joint probabilities, but only retain maximum joint probability state With result, after the completion of the transmission of all phonic signal character sequences, it can be ensured that get maximum joint probability respective path First output state fraction.Now again by the first output state fraction compared with the first predetermined threshold value, determine whether by Next frame characteristic sequence is cut into synchronous decoding, can be cut on suitable opportunity and synchronize decoding, be advantageous to improve voice The efficiency and accuracy rate of identification.
Further, step S105 can include:
Step 1:When word output state fraction meets the preparatory condition in the collection, the single-tone element search is obtained The first of Network Synchronization decoding transmits the second transmission fraction of word search network synchronous decoding in fraction and the collection.
Phonic signal character sequence decoded result in single-tone element search network meets word search network activation in default collection During condition, word search network in collection can be activated and synchronize decoding, now, the phonic signal character sequence of next frame can be simultaneously defeated Enter word search network in single-tone element search network and collection.When word search network scans for decoding in collection, calculate each After the state transfer of frame, the output state fraction of whole word is checked.Similarly in the first output state of single-tone element search network Fraction, the output state fraction in collection in word search network decoding characterize phonic signal character sequence and word dragnet in collection Network.Further, output state fraction can be by calculating phonic signal character sequence and the connection of word search network primitive in collection Probability is closed to obtain.Wherein, the News Search network that word search network is made up of triphones in collection, its primitive are to contain phoneme Between contextual information triphones.Because the triphones primitive of word search network contains contextual information in collection, belong to Output state fraction of the phonic signal character sequence of word in collection in word search network is relative to single-tone element search network in collection Can be higher.It is opposite, when it is incoming belong to the phonic signal character sequence for collect outer word when, even if having activated collect in word search for network, Its output state fraction relative to single-tone element search network also can than relatively low, and whether be collection in word phonic signal character sequence Do not have too big influence substantially to single-tone element search network, so when input belongs to the phonic signal character for collecting outer word, compared to collection Interior word searches for network, and single-tone element search network can obtain higher output state fraction and transmit fraction;When input belongs in collection During the phonic signal character sequence of word, network is searched for compared to word in collection, single-tone element search network can obtain lower output shape State fraction and transmission fraction.When the voice signal of only input belongs to word in collection in theory, passed in collection obtained by word search network Pass fraction and output state fraction is possible to transmit fraction and output shape corresponding to even more than single-tone element search network State fraction.When in collection word search network output state fraction meet preparatory condition when, such as larger than predetermined threshold value when, calculate from Word searches for network and transmits fraction to pass out this network second in into this collection, then calculates and searched with the single-tone element of its same entrance First during this section of rope network transmits fraction.Further, when word output state fraction is more than given threshold in collection, record The history transmission information of lower output state fraction and output state, can finding activation by history transmission information, this transmits road The active information in footpath, and then whole word start frame can be drawn and transmit fraction.Additionally to calculate the single-tone element of present frame The first output state fraction of network is searched for, and the word search network information is calculated in search term list this period in coactivation collection Phoneme searches for the transmission fraction of network.It can be the transmission fraction during alternative space (token passing) to transmit fraction, It has recorded the output state fraction in certain section of section in complete decoding paths, and this section refers to decoding mould from entrance here For type to the section of output decoded model, the matching transmitted in decoding process in some section can be intuitively illustrated in by transmitting fraction Degree.
Step 2:Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
Word search network is carried out same to identical frame length phonic signal character sequence in obtaining single-tone element search network and collecting After the transmission fraction for walking decoding, fraction is transmitted as fraction is referred to using the first of single-tone element search network, considers word in collection Second transmission fraction of search network acquires confidence level.Further, confidence level can be defined as the second transmission fraction and First transmits the ratio of fraction, and which characterizes the matching credibility in each path of phonic signal character sequence of input.
Further, step S105 can also include:
Step 1:When word output state fraction is more than the second predetermined threshold value in the collection, pass through network topology point Described first is not obtained transmits fraction and the second transmission fraction.
By word output state fraction in the collection of acquisition compared with the second predetermined threshold value, when word output state fraction is more than in collection During the second predetermined threshold value, obtain word in the first transmission fraction and collection of single-tone element search network respectively by network topology and search The second of rope network transmits fraction.Further, phonic signal character sequence is being entered including single-tone element search network and collection Can generate a token before the word network to be identified of interior word search network, it have recorded traceback information and transmits fraction, when up to To output state fraction is calculated during word network output state to be identified, then subtracted using output state fraction and traced back to by token Input point fraction just obtains the transmission fraction during word network to be identified.Transmit fraction essential record is certain section in fullpath Output state fraction in section, here this section refer to from word network to be identified is entered to exporting word network to be identified. Further, network is searched for different from word in collection, single-tone element search network is only transmitted using initialization token, do not given birth to New token is produced, but the token traceback information that network can be searched for according to word in collection obtains the transmission point of identical transfer stages Number.Specifically, the first transmission fraction of single-tone element search network is maximum by current single-tone element search network in same time period Output state fraction subtracts in collection that token traces back to what the output state fraction of input point obtained in word search network, because single-tone A unique optimal path is only remained in the search procedure of element search network.Input language can be obtained by network topology After sound signal characteristic sequence, the first of single-tone element search network synchronous decoding transmits word search network synchronous decoding in fraction and collection Second transmit fraction.
Step 2:Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
After obtaining the first transmission fraction and the second transmission fraction, fraction is transmitted as reference using the first of single-tone element search network Fraction, the ratio of the second transmission fraction and the first transmission fraction is defined as confidence level, the voice signal for characterizing input is special Levy the matching credibility in sequence each path in word search network in single-tone element search network and collection.Now, confidence value is got over Small expression matching confidence is higher.Similarly, when being defined as confidence level with the ratio of the first transmission fraction and the second transmission fraction, Confidence value is more big, shows that matching confidence is higher.According to the transmission fraction of word search network in single-tone element search network and collection Ratio carry out confidence level definition, recycle the confidence level be identified result selection, can be very good refusal and fall major part The interference of the outer word of collection, it is ensured that the degree of accuracy of identification.
Further, step S107 can include:
Step 1:Acquisition meets the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition Frame number.
Traditional is directly compared by the confidence level between different words, and recognition result effect is determined with optimal confidence level Means are not ideal, and actually voice signal has continuity, so when phonetic feature sequence is with search net mate, Very high fraction can be obtained by not only having a frame, it may appear that multiframe is such.In view of the successional feature of voice signal, obtaining After taking confidence level, statistics meets the frame number of phonic signal character sequence corresponding to the confidence level of confidence threshold value condition.Specifically, For each decoding paths, it meets that the frame number of confidence threshold value condition is different, counts in each path and meet confidence threshold value The frame number of condition.Word search network is independent in each collection, and they have different phonemes and phoneme number, therefore to be Word sets respective confidence threshold value in each collection.Searched when confidence level is defined as collecting interior word search network delivery fraction with single-tone element It is considered to collect outer word pronunciation and refuse identification during the ratio of rope network delivery fraction, during higher than threshold value, can be protected during less than threshold value Remain for integrated decision-making.
Step 2:According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
After obtaining the frame number for meeting confidence threshold value condition in each decoding paths, according to the most solution of the frame number for the condition that meets Decoding is identified in code path, exports and obtains voice identification result.Confidence calculations result and word bag in collection in the present embodiment Number containing phoneme has a direct contact, the optimal confidence level that different words calculates be it is different, therefore, the confidence between different words Degree, which is directly compared, determines that recognition result effect is not ideal.According to optimal confidence level difference between word and voice signal Continuity, the output number that each word meets confidence threshold value is obtained by decoding statistics first, met eventually through comparing The most persons of number are as recognition result.When recognition result selects, the continuity features of voice signal are considered, selection passes through solution Code statistics obtains the output frame number that each word meets confidence threshold value, meets the most persons of frame number as identification knot eventually through comparing Fruit, speech recognition accuracy can be effectively improved.
Further, Fig. 3 is the schematic flow sheet of step before step S101 in an embodiment, as shown in figure 3, step Include before S101:
Step S101a, obtain voice signal.
Voice signal can be realized by speech collecting system.Specifically, voice collecting can be carried out by sound pick-up such as microphone, Processing acquisition is carried out by amplifier and wave filter again.
Step S101b, end-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
The primary speech signal directly gathered by speech collecting system includes many unessential information and ambient noise, Need to do a series of pretreatment to primary speech signal sequence by endpoint detection module, as end-point detection (determines that voice is believed Number the whole story), (be approximately considered is in 10-30ms for framing for pre-filtering (remove individual pronunciation difference and ambient noise etc. influence) Voice signal is short-term stationarity, and voice signal is divided into one section one section is analyzed), adding window (using stationary process analyze Processing method is handled), preemphasis (lifting HFS) and Fourier transformation (being converted into data signal to be easy to handle) etc. Pretreatment.Voice signal by that can obtain input decoded model after pretreatment transmit the phonic signal character sequence of decoding Row.
Further, Fig. 4 is the schematic flow sheet of step S101b in an embodiment.As shown in figure 4, voice signal inputs, Handled first by framing, adding window and preemphasis, then carry out Fast Fourier Transform (FFT) (FFT), further pass through triangle window filtering Device, carry out noise power estimation and smooth power estimation respectively to filtered output signal.Wherein, during noise power estimation, Signal to noise ratio (SNR) is calculated, judges whether signal to noise ratio is more than threshold value, if it is not, then returning to voice signal obtaining step;Smooth power is estimated Timing, the difference of actual power and estimate is calculated, when difference is less than threshold value, return to voice signal obtaining step.Work as noise The signal ratio of power estimation is when being more than threshold value, and when the difference estimated with smooth power of actual power is not less than threshold value, to signal Discrete cosine transform (DCT) is carried out, output obtains phonic signal character sequence.
In addition, the application also provides a kind of speech recognition system.Fig. 5 is the embodiment of speech recognition system one of the application Structural representation.As shown in figure 5, a kind of speech recognition system, including:
Synchronous decoding module 100, for phonic signal character sequence to be inputted into word in single-tone element search network and collection respectively Network is searched for, and carries out synchronous decoding;
State fraction acquisition module 300, word output state fraction in the collection obtained for obtaining the synchronous decoding;
Confidence level acquisition module 500, for when word output state fraction meets preparatory condition in the collection, described in acquisition The confidence level of single-tone element search network and word search network synchronous decoding in the collection;
Speech recognition output module 700, voice is obtained for decoding paths, output according to corresponding to confidence level selection Recognition result.
Above-mentioned speech recognition system, phonic signal character sequence is passed through into single-tone element dragnet respectively by synchronous decoding module Word search network synchronizes decoding and transmitted in network and collection, when word search network decoding in the collection that state fraction acquisition module obtains Collection in word output state fraction when meeting preparatory condition, single-tone element search network and word in collection are obtained by confidence level acquisition module Search for network synchronous decoding confidence level, finally as speech recognition output module according to corresponding to the confidence level decoding paths, it is defeated Go out to obtain voice identification result.By simultaneously by word dragnet in phonic signal character sequence inputting single-tone element search network and collection Network carries out decoding transmission, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that recognition accuracy;Further according to confidence Decoding paths corresponding to degree selection obtain voice identification result, can further improve speech discrimination accuracy.
Further, a kind of computer equipment is also provided, the computer equipment includes memory, processor and is stored in On reservoir and the computer program that can run on a processor, wherein, each reality as described above is realized during computing device described program Apply any one audio recognition method in example.
The computer equipment, during its computing device program, pass through any one language realized in each embodiment as described above Voice recognition method, enter so as to which phonic signal character sequence is searched for into network by word in single-tone element search network and collection respectively Row synchronous decoding transmission, when word output state fraction meets preparatory condition in the collection that word search network decoding obtains in collection, obtain The confidence level of single-tone element search network and word search network synchronous decoding in collection is taken, finally decodes road according to corresponding to the confidence level Footpath, output obtain voice identification result.By simultaneously by word in phonic signal character sequence inputting single-tone element search network and collection Search network carries out decoding transmission, it is other to effectively realize word identification and the outer word rejection of collection in collection, it is ensured that recognition accuracy;Root again According to confidence level select corresponding to decoding paths obtain voice identification result, can further improve speech discrimination accuracy.
In addition, one of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, It is that by computer program the hardware of correlation can be instructed to complete, described program can be stored in a non-volatile calculating In machine read/write memory medium, in the embodiment of the present application, the program can be stored in the storage medium of computer system, and by At least one computing device in the computer system, to realize the stream for including the embodiment of each audio recognition method as described above Journey.
Further, a kind of storage medium is also provided, is stored thereon with computer program, wherein, the program is by processor Any one audio recognition method in each embodiment as described above is realized during execution.Wherein, described storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.
Embodiment described above only expresses the several embodiments of the application, and its description is more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art Say, on the premise of the application design is not departed from, various modifications and improvements can be made, these belong to the protection of the application Scope.Therefore, the protection domain of the application patent should be determined by the appended claims.

Claims (10)

  1. A kind of 1. audio recognition method, it is characterised in that including:
    Phonic signal character sequence is inputted into word search network in single-tone element search network and collection respectively, and carries out synchronous decoding;
    Obtain word output state fraction in the collection that the synchronous decoding obtains;
    When word output state fraction meets preparatory condition in the collection, the single-tone element search network and word in the collection are obtained Search for the confidence level of network synchronous decoding;
    Voice identification result is obtained according to decoding paths, output corresponding to confidence level selection.
  2. 2. audio recognition method according to claim 1, it is characterised in that described that phonic signal character sequence difference is defeated The step of entering word search network in single-tone element search network and collection, and carrying out synchronous decoding includes:
    By single-tone element search network described in current frame speech signal characteristic sequence inputting, the first output state fraction is obtained;
    When the first output state fraction is more than the first predetermined threshold value, next frame phonic signal character sequence is inputted respectively Word search network synchronizes decoding in the single-tone element search network and the collection.
  3. 3. audio recognition method according to claim 2, it is characterised in that described by current frame speech signal characteristic sequence The step of inputting the single-tone element search network, obtaining the first output state fraction includes:
    By single-tone element search network described in the current frame speech signal characteristic sequence inputting;
    Obtain the current frame speech signal characteristic sequence and the joint probability of single-tone element search network primitive;
    Using the maximum in the joint probability as the first output state fraction.
  4. 4. audio recognition method according to claim 1, it is characterised in that described when word output state fraction in the collection When meeting preparatory condition, the single-tone element search network and the step of the confidence level of word search network synchronous decoding in the collection are obtained Suddenly include:
    When word output state fraction meets the preparatory condition in the collection, the single-tone element search network synchronous decoding is obtained First transmit second of word search network synchronous decoding in fraction and the collection and transmit fraction;
    Fraction and described second is transmitted according to described first and transmits fraction, obtains the confidence level.
  5. 5. audio recognition method according to claim 4, it is characterised in that described when word output state fraction in the collection When meeting preparatory condition, the single-tone element search network and the step of the confidence level of word search network synchronous decoding in the collection are obtained Suddenly include:
    When word output state fraction is more than the second predetermined threshold value in the collection, described is obtained respectively by network topology One, which transmits fraction and described second, transmits fraction;
    Fraction, which is transmitted, using described second transmits the ratio of fraction as the confidence level with described first.
  6. 6. audio recognition method according to claim 1, it is characterised in that described according to corresponding to confidence level selection Decoding paths, exporting the step of obtaining voice identification result includes:
    Obtain the frame number for meeting the phonic signal character sequence corresponding to the confidence level of confidence threshold value condition;
    According to the output of decoding paths corresponding to the frame number maximum, institute's speech recognition result is obtained.
  7. 7. audio recognition method according to claim 1, it is characterised in that described that phonic signal character sequence difference is defeated Include before the step of entering word search network in single-tone element search network and collection, and carrying out synchronous decoding:
    Obtain voice signal;
    End-point detection is carried out to the voice signal of acquisition, obtains the phonic signal character sequence.
  8. A kind of 8. speech recognition system, it is characterised in that including:
    Synchronous decoding module, for phonic signal character sequence to be inputted into word dragnet in single-tone element search network and collection respectively Network, and carry out synchronous decoding;
    State fraction acquisition module, word output state fraction in the collection obtained for obtaining the synchronous decoding;
    Confidence level acquisition module, for when word output state fraction meets preparatory condition in the collection, obtaining the single-tone element Search for the confidence level of network and word search network synchronous decoding in the collection;
    Speech recognition output module, speech recognition knot is obtained for decoding paths, output according to corresponding to confidence level selection Fruit.
  9. 9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, it is characterised in that realize such as claim 1 to 7 described in the computing device during computer program Audio recognition method described in any one.
  10. 10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The audio recognition method as described in claim 1 to 7 any one is realized during execution.
CN201711031665.9A 2017-10-27 2017-10-27 Speech recognition method, system, computer device and computer-readable storage medium Active CN107871499B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711031665.9A CN107871499B (en) 2017-10-27 2017-10-27 Speech recognition method, system, computer device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711031665.9A CN107871499B (en) 2017-10-27 2017-10-27 Speech recognition method, system, computer device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN107871499A true CN107871499A (en) 2018-04-03
CN107871499B CN107871499B (en) 2020-06-16

Family

ID=61753362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711031665.9A Active CN107871499B (en) 2017-10-27 2017-10-27 Speech recognition method, system, computer device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN107871499B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109273007A (en) * 2018-10-11 2019-01-25 科大讯飞股份有限公司 Voice awakening method and device
CN110111775A (en) * 2019-05-17 2019-08-09 腾讯科技(深圳)有限公司 A kind of Streaming voice recognition methods, device, equipment and storage medium
WO2019214361A1 (en) * 2018-05-08 2019-11-14 腾讯科技(深圳)有限公司 Method for detecting key term in speech signal, device, terminal, and storage medium
CN111862943A (en) * 2019-04-30 2020-10-30 北京地平线机器人技术研发有限公司 Speech recognition method and apparatus, electronic device, and storage medium
CN112331219A (en) * 2020-11-05 2021-02-05 北京爱数智慧科技有限公司 Voice processing method and device
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
CN114783438A (en) * 2022-06-17 2022-07-22 深圳市友杰智新科技有限公司 Adaptive decoding method, apparatus, computer device and storage medium
CN115831100A (en) * 2023-02-22 2023-03-21 深圳市友杰智新科技有限公司 Voice command word recognition method, device, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100679051B1 (en) * 2005-12-14 2007-02-05 삼성전자주식회사 Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms
CN101030369B (en) * 2007-03-30 2011-06-29 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
CN101604520A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 Spoken language voice recognition method based on statistical model and syntax rule
CN101763855B (en) * 2009-11-20 2012-01-04 安徽科大讯飞信息科技股份有限公司 Method and device for judging confidence of speech recognition
CN105321518B (en) * 2014-08-05 2018-12-04 中国科学院声学研究所 A kind of rejection method for identifying of low-resource Embedded Speech Recognition System
CN105161096B (en) * 2015-09-22 2017-05-10 百度在线网络技术(北京)有限公司 Speech recognition processing method and device based on garbage models
CN106683677B (en) * 2015-11-06 2021-11-12 阿里巴巴集团控股有限公司 Voice recognition method and device
CN106782513B (en) * 2017-01-25 2019-08-23 上海交通大学 Speech recognition realization method and system based on confidence level

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11341957B2 (en) 2018-05-08 2022-05-24 Tencent Technology (Shenzhen) Company Limited Method for detecting keyword in speech signal, terminal, and storage medium
WO2019214361A1 (en) * 2018-05-08 2019-11-14 腾讯科技(深圳)有限公司 Method for detecting key term in speech signal, device, terminal, and storage medium
CN109273007B (en) * 2018-10-11 2022-05-17 西安讯飞超脑信息科技有限公司 Voice wake-up method and device
CN109273007A (en) * 2018-10-11 2019-01-25 科大讯飞股份有限公司 Voice awakening method and device
CN111862943A (en) * 2019-04-30 2020-10-30 北京地平线机器人技术研发有限公司 Speech recognition method and apparatus, electronic device, and storage medium
CN111862943B (en) * 2019-04-30 2023-07-25 北京地平线机器人技术研发有限公司 Speech recognition method and device, electronic equipment and storage medium
CN110111775A (en) * 2019-05-17 2019-08-09 腾讯科技(深圳)有限公司 A kind of Streaming voice recognition methods, device, equipment and storage medium
CN112331219A (en) * 2020-11-05 2021-02-05 北京爱数智慧科技有限公司 Voice processing method and device
CN112331219B (en) * 2020-11-05 2024-05-03 北京晴数智慧科技有限公司 Voice processing method and device
CN112652306A (en) * 2020-12-29 2021-04-13 珠海市杰理科技股份有限公司 Voice wake-up method and device, computer equipment and storage medium
CN112652306B (en) * 2020-12-29 2023-10-03 珠海市杰理科技股份有限公司 Voice wakeup method, voice wakeup device, computer equipment and storage medium
CN114783438A (en) * 2022-06-17 2022-07-22 深圳市友杰智新科技有限公司 Adaptive decoding method, apparatus, computer device and storage medium
CN114783438B (en) * 2022-06-17 2022-09-27 深圳市友杰智新科技有限公司 Adaptive decoding method, apparatus, computer device and storage medium
CN115831100A (en) * 2023-02-22 2023-03-21 深圳市友杰智新科技有限公司 Voice command word recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN107871499B (en) 2020-06-16

Similar Documents

Publication Publication Date Title
CN107871499A (en) Audio recognition method, system, computer equipment and computer-readable recording medium
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
JP4195428B2 (en) Speech recognition using multiple speech features
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN102723078B (en) Emotion speech recognition method based on natural language comprehension
CN107369439B (en) Voice awakening method and device
CN105206258A (en) Generation method and device of acoustic model as well as voice synthetic method and device
CN105895078A (en) Speech recognition method used for dynamically selecting speech model and device
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
Brandes Feature vector selection and use with hidden Markov models to identify frequency-modulated bioacoustic signals amidst noise
CN111105785B (en) Text prosody boundary recognition method and device
CN110265063B (en) Lie detection method based on fixed duration speech emotion recognition sequence analysis
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
CN112735477B (en) Voice emotion analysis method and device
CN116580706B (en) Speech recognition method based on artificial intelligence
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN111341319A (en) Audio scene recognition method and system based on local texture features
CN111883181A (en) Audio detection method and device, storage medium and electronic device
Kharamat et al. Durian ripeness classification from the knocking sounds using convolutional neural network
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN108364655A (en) Method of speech processing, medium, device and computing device
Ling An acoustic model for English speech recognition based on deep learning
CN118136022A (en) Intelligent voice recognition system and method
CN102141812A (en) Robot
CN111402887A (en) Method and device for escaping characters by voice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 519000 No. 333, Kexing Road, Xiangzhou District, Zhuhai City, Guangdong Province

Patentee after: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.

Address before: Floor 1-107, building 904, ShiJiHua Road, Zhuhai City, Guangdong Province

Patentee before: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.