CN108428448A - A kind of sound end detecting method and audio recognition method - Google Patents

A kind of sound end detecting method and audio recognition method Download PDF

Info

Publication number
CN108428448A
CN108428448A CN201710076757.2A CN201710076757A CN108428448A CN 108428448 A CN108428448 A CN 108428448A CN 201710076757 A CN201710076757 A CN 201710076757A CN 108428448 A CN108428448 A CN 108428448A
Authority
CN
China
Prior art keywords
voice data
frame
label
voice
mute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710076757.2A
Other languages
Chinese (zh)
Inventor
范利春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yutou Technology Hangzhou Co Ltd
Original Assignee
Yutou Technology Hangzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yutou Technology Hangzhou Co Ltd filed Critical Yutou Technology Hangzhou Co Ltd
Priority to CN201710076757.2A priority Critical patent/CN108428448A/en
Priority to PCT/CN2018/074311 priority patent/WO2018145584A1/en
Priority to TW107104564A priority patent/TWI659409B/en
Publication of CN108428448A publication Critical patent/CN108428448A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a kind of sound end detecting method and audio recognition methods, belong to technical field of voice recognition;Method includes:It extracts the phonetic feature of voice data and is input in silence model;Silence model is according to phonetic feature output label for indicating whether voice data is mute frame;The sound end of one section of voice is confirmed according to the label of the voice data of successive frame:In unactivated state, if the length for the voice data of non-mute frame continuously occur is more than a preset first threshold, judge that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;In state of activation, if the continuous length for the voice data of mute frame occur is more than a preset second threshold, judge that first frame is the end caps that the voice data of mute frame is one section of voice.The advantageous effect of above-mentioned technical proposal is:Solve the problems, such as that speech terminals detection is inaccurate and excessively high for detection environmental requirement in the prior art.

Description

A kind of sound end detecting method and audio recognition method
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of sound end detecting method and speech recognition sides Method.
Background technology
With the development of speech recognition technology, the application of speech recognition in people's lives is more and more extensive.Work as user When using speech recognition technology in handheld device, it will usually coordinate speech recognition button to control the voice paragraph for needing to identify Beginning and end time, but when user be in smart home environment use speech recognition technology when, can because away from From pick up facility the beginning endpoint and end caps of voice paragraph can not be determined manually by the way of button cooperation farther out, this When just need another mode to carry out automatic decision namely speech terminals detection technology to time of voice beginning and end (Voice Active Detection, VAD).
Traditional end-point detecting method is based primarily upon sub-belt energy progress, that is, calculates per frame voice data in a certain frequency range Energy, and be compared to preset energy threshold to judge the beginning endpoint of voice and end caps.This endpoint inspection Survey method is more demanding to detection environment, speech recognition must be carried out in quiet environment and just can guarantee the language detected The accuracy of voice endpoint.And in relatively noisy noise circumstance, different types of noise can generate different sub-belt energies It influences, to be brought to above-mentioned end-point detecting method in the noise circumstance of larger interference, especially low signal-to-noise ratio and non-stationary, Very big interference can be caused to the calculating of sub-belt energy, so that final testing result is inaccurate.And only guaranteed end-speech The accuracy of point detection just can guarantee that voice is correctly collected, and then correct identified.The result of end-point detection is not allowed to truly have can Voice can be made to be truncated or the more noises of typing, can cause speech recognition that cannot be decoded to whole word, leaked to bring The problems such as report or wrong report, or even the decoding whole mistake of whole word can be caused, reduce the accuracy of voice recognition result.
Invention content
According to the above-mentioned problems in the prior art, a kind of sound end detecting method and audio recognition method are now provided Technical solution, it is intended to solve that speech terminals detection in the prior art is inaccurate and environmental requirement is excessively high asks for detection Topic.Above-mentioned technical proposal specifically includes:
A kind of sound end detecting method, wherein training in advance forms one for judging whether voice data is mute frame Silence model, then obtain one section of voice of the externally input voice data including successive frame, and execute following steps Suddenly:
Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to described quiet In sound model;
Step S2, the silence model are associated with the mark of voice data described in each frame according to phonetic feature output Label, the label is for indicating whether the voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice of non-mute frame The length of data is more than a preset first threshold, then judges that first frame be the voice data of the non-mute frame is one section The starting endpoint of the voice;
When the pick up facility for acquiring the voice is active, if continuously there is the voice of the mute frame The length of data is more than a preset second threshold, then judges that first frame be the voice data of the mute frame is one section of institute The end caps of predicate sound.
Preferably, sound end detecting method, wherein training forms the silence model in advance by following methods:
Step A1 inputs preset multiple training voice data, and extracts the language of each training voice data Sound feature;
Step A2 carries out automatic marking for being trained described in every frame according to the corresponding phonetic feature with voice data Operation obtains a label of voice data described in corresponding every frame;The label is for indicating voice data described in a corresponding frame For mute frame or non-mute frame;
Step A3 obtains the silence model according to training voice data and the corresponding label training;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to the non-mute frame.
Preferably, sound end detecting method, wherein corresponding externally input each training voice data is equal A mark text is pre-set, to mark the corresponding content of text of the training voice data;
Then the step A2 is specifically included:
Step A21 obtains the phonetic feature and the corresponding mark text;
Step A22, the acoustic model formed using advance training is to the phonetic feature and the corresponding mark text Pressure alignment is carried out, to obtain the output label that phonetic feature described in every frame corresponds to phone;
Step A23, to being post-processed with voice data by the training for forcing alignment, by mute phone The output label be mapped on the label for indicating the mute frame, and by the output label of non-mute phone It is mapped on the label for indicating the non-mute frame.
Preferably, the sound end detecting method, wherein in the step A22, the acoustic mode of training formation in advance Type is gauss hybrid models-Hidden Markov Model, or is deep neural network-Hidden Markov Model.
Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network Through network model.
Preferably, sound end detecting method, wherein wrapped before every two layers neural network of the silence model Include at least one nonlinear transformation.
Preferably, sound end detecting method, wherein every layer of neural network of the silence model is full connection Neural network either convolutional neural networks or recurrent neural network.
Preferably, sound end detecting method, wherein the silence model is the depth god for including multilayer neural network Through network model;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to non-mute frame;
Then the step S2 is specifically included:
Step S21 passes through the forward calculation of neural network described in multilayer after the phonetic feature inputs the silence model It respectively obtains the first value for being associated with the first node in the output layer and is associated with the second of the second node Value;
First value is compared by step S22 with second value:
If first value is more than second value, using the first node as described in the voice data Label simultaneously exports;
If first value is less than second value, using the second node as described in the voice data Label simultaneously exports.
A kind of audio recognition method, wherein detect to obtain need to identify one using above-mentioned sound end detecting method The starting endpoint of Duan Yuyin and the end caps.
The advantageous effect of above-mentioned technical proposal is:A kind of sound end detecting method is provided, can be solved in the prior art Speech terminals detection is inaccurate and for the excessively high problem of detection environmental requirement, therefore promotes the accurate of speech terminals detection Property, the wide usage of extension endpoint detection method, to improve entire speech recognition process.
Description of the drawings
Fig. 1 is a kind of overall procedure schematic diagram of sound end detecting method in the preferred embodiment of the present invention;
Fig. 2 is in the preferred embodiment of the present invention, and training forms the flow diagram of silence model;
Fig. 3 on the basis of Fig. 2, is marked automatically with voice data to training in the preferred embodiment of the present invention The flow diagram of note;
Fig. 4 is to include the structural schematic diagram of the silence model of multilayer neural network in the preferred embodiment of the present invention;
Fig. 5 on the basis of Fig. 1, is handled and is exported and be associated with voice data in the preferred embodiment of the present invention The flow diagram of label.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of not making creative work it is all its His embodiment, shall fall within the protection scope of the present invention.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.
The invention will be further described in the following with reference to the drawings and specific embodiments, but not as limiting to the invention.
According to the above-mentioned problems in the prior art, a kind of sound end detecting method is now provided, in this method, in advance Training formed one for judge voice data whether be mute frame silence model, then obtain it is externally input include successive frame Voice data one section of voice, and execute following step as shown in Figure 1:
Step S1, extracts the phonetic feature of each frame voice data, and phonetic feature is input in silence model;
Step S2, silence model are associated with the label of each frame voice data according to phonetic feature output, and label is used for table Show whether voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring voice is in unactivated state, if continuously there is the length of the voice data of non-mute frame Degree is more than a preset first threshold, then judges that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;
When the pick up facility for acquiring voice is active, if the continuous length for the voice data of mute frame occur is big In a preset second threshold, then judge that first frame is the end caps that the voice data of mute frame is one section of voice.
Specifically, in the present embodiment, it is initially formed a silence model, which can be used for judging in one section of voice Every frame voice data whether be mute frame.So-called mute frame refers to not including the efficient voice for needing to carry out speech recognition Voice data;So-called non-mute frame refers to the voice data for including the efficient voice for needing to carry out speech recognition.
Then, in the present embodiment, after training forms silence model, each frame voice in externally input one section of voice is extracted The phonetic feature of data, and the phonetic feature extracted is input in silence model, which is associated with output Label.In the present embodiment, one co-exists in two labels, is respectively used to indicate that the frame voice data is mute frame/non-mute frame.
In the present embodiment, after having obtained the mute and non-mute classification of each frame voice data, then sound end is judged. However the non-mute frame of a frame not occur can think that one section of voice starts, and a frame mute frame can not occur and be considered as one section of language Sound terminates, but needs to judge according to the frame number of continuous mute frame/non-mute frame the starting endpoint and end of one section of voice Point.Specially:
When the pick up facility for acquiring voice is in unactivated state, if continuously there is the length of the voice data of non-mute frame Degree is more than a preset first threshold, then judges that first frame is the starting endpoint that the voice data of non-mute frame is one section of voice;
When the pick up facility for acquiring voice is active, if the continuous length for the voice data of mute frame occur is big In a preset second threshold, then judge that first frame is the end caps that the voice data of mute frame is one section of voice.
In the preferred embodiment of the present invention, above-mentioned first threshold can be taken with value 30, above-mentioned second threshold Value 50.I.e.:
When the pick up facility for acquiring voice is in unactivated state, if the length for non-mute frame continuously occur is more than 30 (the continuous non-mute frame of 30 frame occur), then judge the non-mute frame of first frame for the starting endpoint of this section of voice.
When the pick up facility for acquiring voice is active, (occur if the continuous length for mute frame occur is more than 50 Continuous 50 frame mute frame), then judge first frame mute frame for the end caps of this section of voice.
In another preferred embodiment of the present invention, above-mentioned first threshold equally can be with value 70, above-mentioned second threshold It can be with value 50.
In the other embodiment of the present invention, taking for first threshold and second threshold can be freely set according to actual conditions Value, to meet the needs of speech terminals detection under varying environment.
In the preferred embodiment of the present invention, it can be formed by following methods as shown in Figure 2 training in advance mute Model:
Step A1 inputs preset multiple training voice data, and the voice for extracting each training voice data is special Sign;
Step A2 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains A corresponding label per frame voice data;Label is for indicating that corresponding frame voice data is mute frame or non-mute frame;
Step A3 carries out automatic marking operation for every frame training according to corresponding phonetic feature with voice data, obtains A corresponding label per frame voice data;Label is for indicating that corresponding frame voice data is mute frame or non-mute frame;
It is provided with first node and second node on the output layer of silence model;
First node is used to indicate the label of corresponding mute frame;
The label that second node is used to indicate to correspond to non-mute frame.
Specifically, in the present embodiment, preset multiple training voice data are inputted first.So-called training voice number According to referring to the voice data that its content of text is known in advance.During the training can be according to having trained with voice data in advance The Chinese speech data set of the speech recognition system of text extracts to obtain, and possesses the mark text of corresponding training voice data This.That is the voice applied when the training voice data inputted in above-mentioned steps A1 and the acoustic model of training subsequent speech recognition Data are identical.
In the present embodiment, after inputting training voice data, its language is extracted respectively with voice data for each training Sound feature.The extraction of phonetic feature can also use the phonetic feature extracted when the acoustic model of trained speech recognition.It is common Phonetic feature may include mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), Perceive linear prediction (Perceptual Linear Predictive, PLP) or filter group (Filter-Bank, FBANK) Feature.Similarly, in other embodiments of the invention, other similar phonetic features may be used to complete silence model Training.
In the present embodiment, in above-mentioned steps A2, before the training input parameter as silence model, it is necessary first to upper It states training and carries out automatic marking operation with voice data, so that per frame speech data frame alignment.In above-mentioned automatic marking operation, often One frame voice data can all obtain a label, and the processing method of above-mentioned automatic marking can hereinafter be described in detail, by automatic After labeling operation, so that it may to train silence model.
In the preferred embodiment of the present invention, corresponding externally input each training pre-sets a mark with voice data Explanatory notes sheet, to mark the corresponding content of text of training voice data;
Then above-mentioned steps A2 is specific as shown in figure 3, may include:
Step A21 obtains phonetic feature and corresponding mark text;
Step A22, the acoustic model formed using advance training force phonetic feature and corresponding mark text Alignment, to obtain the output label that every frame phonetic feature corresponds to phone;
Step A23 post-processes the training by pressure alignment with voice data, by the output mark of mute phone Label are mapped on the label for indicating mute frame, and the output label of non-mute phone is mapped to the label for indicating non-mute frame On.
Specifically, in the present embodiment, automatic marking operation is carried out with voice data to training according to craft, then needs to consume Take a large amount of cost of labor, and for noise be labeled in the annotation results of different mark personnel also will appear it is inconsistent The case where, to influence the process of follow-up training pattern.Therefore it is provided in technical solution of the present invention a kind of efficiently feasible automatic Mask method.
In the above method, the phonetic feature of each frame training voice data and corresponding mark text are obtained first, Pressure alignment then is carried out to phonetic feature and mark text.
In the present embodiment, the acoustic model (acoustic model that i.e. training is formed in advance) of subsequent speech recognition can be utilized right Phonetic feature and mark text carry out pressure alignment.The acoustic model of speech recognition in the present invention can be Gaussian Mixture mould Type-Hidden Markov Model (Gaussian Mixture Model-Hidden Markov Model, GMM-HMM), can also It is deep neural network-hidden Markov model (Deep Neural Network-Hidden Markov Model, DNN- ) or other suitable models HMM.Modeling unit in above-mentioned acoustic model is phone (phone) rank, such as up and down Literary independent phone (Context Independent Phone, ci-phone) or context-sensitive phone (Context Dependent Phone, cd-phone).It carries out forcing alignment operation can will be to training voice using above-mentioned acoustic model Data frame alignment is to phone rank.
In the present embodiment, in above-mentioned steps A23, to by forcing the training of alignment to be carried out with voice data post-processing it Afterwards, you can obtain the voice data that frame corresponds to mute label.In above-mentioned post-processing operation, usually regard part phone as quiet Sound phone regards other phones as non-mute phone, and after above-mentioned mapping, each frame voice data can be with quiet Sound/non-mute label is mapped.
In the preferred embodiment of the present invention, followed by the label energy of phonetic feature and the frame alignment above obtained Enough train silence model.Above-mentioned silence model can be the deep neural network model for including multilayer neural network.It is above-mentioned quiet Every layer of sound model can be the neural network connected entirely, convolutional neural networks, recurrent neural network etc., every two layers of neural network Can include one or more nonlinear transformations, such as signoid nonlinear transformations, tanh nonlinear transformations, maxpool before Nonlinear transformation, RELU nonlinear transformations or softmax nonlinear transformations.
In the preferred embodiment of the present invention, as shown in figure 4, the silence model includes multilayer neural network 41, and Including an output layer 42.First node 421 and second node 422 are set in the output layer 42 of the silence model.Above-mentioned first Node 421 is used to indicate the label of corresponding mute frame, the label that second node 422 is used to indicate to correspond to non-mute frame.It is exporting Softmax nonlinear transformations or other nonlinear transformations behaviour can be carried out on the first node 421 and second node 422 of layer 42 Make, can not also be operated using nonlinear transformation.
Then in preferred embodiment of the invention, above-mentioned steps S2 is specific as shown in figure 5, including:
Step S21 respectively obtains output after phonetic feature inputs silence model by the forward calculation of multilayer neural network It is associated with the first value of first node in layer and is associated with the second value of second node;
First value is compared by step S22 with the second value:
If the first value is more than the second value, using first node as the label of voice data and output;
If the first value is less than the second value, using second node as the label of voice data and output.
Specifically, in the present embodiment, phonetic feature is input in trained silence model, multilayer neural network carries out Forward calculation, and finally obtain the value of two output nodes (first node and second node) in output layer, i.e., first takes Value and the second value.Then compare the size of the first value and the second value:
If the first value is larger, label and output of the first node as voice data are selected, i.e. voice data at this time For mute frame;
Correspondingly, label and output of the second node as voice data are selected if the second value is larger, i.e. language at this time Sound data are non-mute frame.
In the preferred embodiment of the present invention, in an entire flow following article of above-mentioned sound end detecting method It is described:
Prepare the good Chinese speech recognizing system of a precondition first, the speech recognition system selected here has Chinese Voice data collection, and possess the mark text of voice data.
The training that the acoustic model of above-mentioned speech recognition system uses is characterized as FBANK features with voice, therefore training is quiet FBANK features are still used when sound model.
Training is extracted into phonetic feature with voice data, and with being carried out in corresponding mark text input speech recognition system Alignment is forced, each frame phonetic feature is corresponded into phone grade distinguishing label, is then mapped to non-mute phone in alignment result On non-mute label, mute phone is mapped on mute label, the training data label to complete silence model prepares.
Then, silence model is formed using above-mentioned training voice data and corresponding label training
When the silence model formed using above-mentioned training carries out the detection of sound end, by each frame in one section of voice It is sent into trained silence model after voice data extraction phonetic feature, is exported after the forward calculation of multilayer neural network First value of first node and the second value of second node, then compare the size of two values, larger pair of output value Label of the label for the node answered as the frame voice data, to indicate the frame voice data as mute frame/non-mute frame.
Finally, mute frame/non-mute frame of successive frame is judged whether:
When the pick up facility for acquiring voice is in unactivated state, if there are the continuous non-mute frames of 30 frame, by the company The starting endpoint of the first frame voice data voice to be identified as whole section in the non-mute frame of continuous 30 frames;
When acquire voice pick up facility be active when, if there are continuous 50 frame mute frame, by this continuous 50 The end caps of the first frame voice data voice to be identified as whole section in frame mute frame.
In the preferred embodiment of the present invention, a kind of audio recognition method is also provided, wherein being examined using above-mentioned sound end Survey method detects to obtain the starting endpoint and end caps for one section of voice for needing to identify, to determine the model for the voice for needing to identify It encloses, then this section of voice is identified using existing speech recognition technology again.
The foregoing is merely preferred embodiments of the present invention, are not intended to limit embodiments of the present invention and protection model It encloses, to those skilled in the art, should can appreciate that all with made by description of the invention and diagramatic content Equivalent replacement and obviously change obtained scheme, should all be included within the scope of the present invention.

Claims (9)

1. a kind of sound end detecting method, which is characterized in that training in advance forms one for judging whether voice data is quiet The silence model of sound frame then obtains one section of voice of the externally input voice data including successive frame, and under execution State step:
Step S1 extracts the phonetic feature of voice data described in each frame, and the phonetic feature is input to the mute mould In type;
Step S2, the silence model are associated with the label of voice data described in each frame, institute according to phonetic feature output Label is stated for indicating whether the voice data is mute frame;
Step S3 confirms the sound end of one section of voice according to the label of the voice data of successive frame:
When the pick up facility for acquiring the voice is in unactivated state, if continuously there is the voice data of non-mute frame Length be more than a preset first threshold, then judge that first frame be the voice data of the non-mute frame is described in one section The starting endpoint of voice;
When the pick up facility for acquiring the voice is active, if continuously there is the voice data of the mute frame Length be more than a preset second threshold, then judge that first frame be the voice data of the mute frame is one section of institute's predicate The end caps of sound.
2. sound end detecting method as described in claim 1, which is characterized in that by following methods, training forms institute in advance State silence model:
Step A1 inputs preset multiple training voice data, and the voice for extracting each training voice data is special Sign;
Step A2 carries out automatic marking operation for being trained described in every frame according to the corresponding phonetic feature with voice data, Obtain a label of voice data described in corresponding every frame;The label is for indicating that voice data described in a corresponding frame is mute Frame or non-mute frame;
Step A3 obtains the silence model according to training voice data and the corresponding label training;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to the non-mute frame.
3. sound end detecting method as claimed in claim 2, which is characterized in that corresponding externally input each training A mark text is pre-set with voice data, to mark the corresponding content of text of the training voice data;
Then the step A2 is specifically included:
Step A21 obtains the phonetic feature and the corresponding mark text;
Step A22, the acoustic model formed using advance training carry out the phonetic feature and the corresponding mark text Alignment is forced, to obtain the output label that phonetic feature described in every frame corresponds to phone;
Step A23, to being post-processed with voice data by the training for forcing alignment, by the institute of mute phone It states output label to be mapped on the label for indicating the mute frame, and the output label of non-mute phone is mapped Onto the label for indicating the non-mute frame.
4. sound end detecting method as claimed in claim 3, which is characterized in that in the step A22, training in advance is formed The acoustic model be gauss hybrid models-Hidden Markov Model, or be deep neural network-Hidden Markov mould Type.
5. sound end detecting method as described in claim 1, which is characterized in that the silence model be include multilayer nerve The deep neural network model of network.
6. sound end detecting method as claimed in claim 5, which is characterized in that every two layers god of the silence model Through including at least one nonlinear transformation before network.
7. sound end detecting method as claimed in claim 5, which is characterized in that every layer of nerve of the silence model Network is the neural network that connects entirely either convolutional neural networks or recurrent neural network.
8. sound end detecting method as claimed in claim 2, which is characterized in that the silence model be include multilayer nerve The deep neural network model of network;
It is provided with first node and second node on the output layer of the silence model;
The label that the first node is used to indicate to correspond to the mute frame;
The label that the second node is used to indicate to correspond to non-mute frame;
Then the step S2 is specifically included:
Step S21 after the phonetic feature inputs the silence model, is distinguished by the forward calculation of neural network described in multilayer The second value for obtaining the first value for being associated with the first node in the output layer and being associated with the second node;
First value is compared by step S22 with second value:
If first value is more than second value, using the first node as the label of the voice data And it exports;
If first value is less than second value, using the second node as the label of the voice data And it exports.
9. a kind of audio recognition method, which is characterized in that using the sound end detecting method inspection as described in claim 1-8 Measure the starting endpoint for one section of voice for needing to identify and the end caps.
CN201710076757.2A 2017-02-13 2017-02-13 A kind of sound end detecting method and audio recognition method Pending CN108428448A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201710076757.2A CN108428448A (en) 2017-02-13 2017-02-13 A kind of sound end detecting method and audio recognition method
PCT/CN2018/074311 WO2018145584A1 (en) 2017-02-13 2018-01-26 Voice activity detection method and voice recognition method
TW107104564A TWI659409B (en) 2017-02-13 2018-02-08 Speech point detection method and speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710076757.2A CN108428448A (en) 2017-02-13 2017-02-13 A kind of sound end detecting method and audio recognition method

Publications (1)

Publication Number Publication Date
CN108428448A true CN108428448A (en) 2018-08-21

Family

ID=63107183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710076757.2A Pending CN108428448A (en) 2017-02-13 2017-02-13 A kind of sound end detecting method and audio recognition method

Country Status (3)

Country Link
CN (1) CN108428448A (en)
TW (1) TWI659409B (en)
WO (1) WO2018145584A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036459A (en) * 2018-08-22 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method, device, computer equipment, computer storage medium
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
CN109378016A (en) * 2018-10-10 2019-02-22 四川长虹电器股份有限公司 A kind of keyword identification mask method based on VAD
CN110634483A (en) * 2019-09-03 2019-12-31 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110827858A (en) * 2019-11-26 2020-02-21 苏州思必驰信息科技有限公司 Voice endpoint detection method and system
CN110875033A (en) * 2018-09-04 2020-03-10 蔚来汽车有限公司 Method, apparatus, and computer storage medium for determining a voice end point
CN110910905A (en) * 2018-09-18 2020-03-24 北京京东金融科技控股有限公司 Mute point detection method and device, storage medium and electronic equipment
CN111063356A (en) * 2018-10-17 2020-04-24 北京京东尚科信息技术有限公司 Electronic equipment response method and system, sound box and computer readable storage medium
CN111128174A (en) * 2019-12-31 2020-05-08 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN111583933A (en) * 2020-04-30 2020-08-25 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
WO2020192009A1 (en) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Silence detection method based on neural network, and terminal device and medium
CN112151073A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice processing method, system, device and medium
CN112259089A (en) * 2019-07-04 2021-01-22 阿里巴巴集团控股有限公司 Voice recognition method and device
CN112652296A (en) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 Streaming voice endpoint detection method, device and equipment
CN112967739A (en) * 2021-02-26 2021-06-15 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN115910043A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN116469413A (en) * 2023-04-03 2023-07-21 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
US11227601B2 (en) * 2019-09-21 2022-01-18 Merry Electronics(Shenzhen) Co., Ltd. Computer-implement voice command authentication method and electronic device
CN111667817A (en) * 2020-06-22 2020-09-15 平安资产管理有限责任公司 Voice recognition method, device, computer system and readable storage medium
US20220103199A1 (en) * 2020-09-29 2022-03-31 Sonos, Inc. Audio Playback Management of Multiple Concurrent Connections
CN112365899A (en) * 2020-10-30 2021-02-12 北京小米松果电子有限公司 Voice processing method, device, storage medium and terminal equipment

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086633A1 (en) * 2000-05-10 2001-11-15 Multimedia Technologies Institute - Mti S.R.L. Voice activity detection and end-point detection
WO2002061727A2 (en) * 2001-01-30 2002-08-08 Qualcomm Incorporated System and method for computing and transmitting parameters in a distributed voice recognition system
CN1953050A (en) * 2005-10-19 2007-04-25 株式会社东芝 Device, method, and for determining speech/non-speech
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN101740024A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Method for automatic evaluation based on generalized fluent spoken language fluency
CN102034475A (en) * 2010-12-08 2011-04-27 中国科学院自动化研究所 Method for interactively scoring open short conversation by using computer
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105374350A (en) * 2015-09-29 2016-03-02 百度在线网络技术(北京)有限公司 Speech marking method and device
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
WO2018145584A1 (en) * 2017-02-13 2018-08-16 芋头科技(杭州)有限公司 Voice activity detection method and voice recognition method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100580770C (en) * 2005-08-08 2010-01-13 中国科学院声学研究所 Voice end detection method based on energy and harmonic
TWI299855B (en) * 2006-08-24 2008-08-11 Inventec Besta Co Ltd Detection method for voice activity endpoint
CN103730110B (en) * 2012-10-10 2017-03-01 北京百度网讯科技有限公司 A kind of method and apparatus of detection sound end
JP5753869B2 (en) * 2013-03-26 2015-07-22 富士ソフト株式会社 Speech recognition terminal and speech recognition method using computer terminal
CN103886871B (en) * 2014-01-28 2017-01-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN104409080B (en) * 2014-12-15 2018-09-18 北京国双科技有限公司 Sound end detecting method and device
CN105869628A (en) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 Voice endpoint detection method and device
CN105976810B (en) * 2016-04-28 2020-08-14 Tcl科技集团股份有限公司 Method and device for detecting end point of effective speech segment of voice

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086633A1 (en) * 2000-05-10 2001-11-15 Multimedia Technologies Institute - Mti S.R.L. Voice activity detection and end-point detection
WO2002061727A2 (en) * 2001-01-30 2002-08-08 Qualcomm Incorporated System and method for computing and transmitting parameters in a distributed voice recognition system
CN1953050A (en) * 2005-10-19 2007-04-25 株式会社东芝 Device, method, and for determining speech/non-speech
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN101740024A (en) * 2008-11-19 2010-06-16 中国科学院自动化研究所 Method for automatic evaluation based on generalized fluent spoken language fluency
CN102034475A (en) * 2010-12-08 2011-04-27 中国科学院自动化研究所 Method for interactively scoring open short conversation by using computer
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition
CN104681036A (en) * 2014-11-20 2015-06-03 苏州驰声信息科技有限公司 System and method for detecting language voice frequency
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN105374350A (en) * 2015-09-29 2016-03-02 百度在线网络技术(北京)有限公司 Speech marking method and device
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105957518A (en) * 2016-06-16 2016-09-21 内蒙古大学 Mongolian large vocabulary continuous speech recognition method
WO2018145584A1 (en) * 2017-02-13 2018-08-16 芋头科技(杭州)有限公司 Voice activity detection method and voice recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田旺兰 等: "《改进运用深度置信网络的语音端点检测方法》", 《计算机工程与应用》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036459B (en) * 2018-08-22 2019-12-27 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device, computer equipment and computer storage medium
CN109036459A (en) * 2018-08-22 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method, device, computer equipment, computer storage medium
CN110875033A (en) * 2018-09-04 2020-03-10 蔚来汽车有限公司 Method, apparatus, and computer storage medium for determining a voice end point
CN110910905B (en) * 2018-09-18 2023-05-02 京东科技控股股份有限公司 Mute point detection method and device, storage medium and electronic equipment
CN110910905A (en) * 2018-09-18 2020-03-24 北京京东金融科技控股有限公司 Mute point detection method and device, storage medium and electronic equipment
CN109378016A (en) * 2018-10-10 2019-02-22 四川长虹电器股份有限公司 A kind of keyword identification mask method based on VAD
CN111063356B (en) * 2018-10-17 2023-05-09 北京京东尚科信息技术有限公司 Electronic equipment response method and system, sound box and computer readable storage medium
CN111063356A (en) * 2018-10-17 2020-04-24 北京京东尚科信息技术有限公司 Electronic equipment response method and system, sound box and computer readable storage medium
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
WO2020192009A1 (en) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Silence detection method based on neural network, and terminal device and medium
CN112151073A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice processing method, system, device and medium
CN112259089A (en) * 2019-07-04 2021-01-22 阿里巴巴集团控股有限公司 Voice recognition method and device
CN110634483B (en) * 2019-09-03 2021-06-18 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110634483A (en) * 2019-09-03 2019-12-31 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
US11620984B2 (en) 2019-09-03 2023-04-04 Beijing Dajia Internet Information Technology Co., Ltd. Human-computer interaction method, and electronic device and storage medium thereof
CN110827858A (en) * 2019-11-26 2020-02-21 苏州思必驰信息科技有限公司 Voice endpoint detection method and system
CN111128174A (en) * 2019-12-31 2020-05-08 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN111583933A (en) * 2020-04-30 2020-08-25 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN111583933B (en) * 2020-04-30 2023-10-27 北京猎户星空科技有限公司 Voice information processing method, device, equipment and medium
CN112652296A (en) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 Streaming voice endpoint detection method, device and equipment
CN112652296B (en) * 2020-12-23 2023-07-04 北京华宇信息技术有限公司 Method, device and equipment for detecting streaming voice endpoint
CN112967739A (en) * 2021-02-26 2021-06-15 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on long-term and short-term memory network
CN115910043A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN116469413A (en) * 2023-04-03 2023-07-21 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence
CN116469413B (en) * 2023-04-03 2023-12-01 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence

Also Published As

Publication number Publication date
TW201830377A (en) 2018-08-16
TWI659409B (en) 2019-05-11
WO2018145584A1 (en) 2018-08-16

Similar Documents

Publication Publication Date Title
CN108428448A (en) A kind of sound end detecting method and audio recognition method
CN103578468B (en) The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition
CN105374356B (en) Audio recognition method, speech assessment method, speech recognition system and speech assessment system
CN107437415B (en) Intelligent voice interaction method and system
US20170140750A1 (en) Method and device for speech recognition
CN103165129B (en) Method and system for optimizing voice recognition acoustic model
KR20190045278A (en) A voice quality evaluation method and a voice quality evaluation apparatus
CN107886968B (en) Voice evaluation method and system
CN106782603B (en) Intelligent voice evaluation method and system
CN104252864A (en) Real-time speech analysis method and system
CN106611604A (en) An automatic voice summation tone detection method based on a deep neural network
CN101510423B (en) Multilevel interactive pronunciation quality estimation and diagnostic system
CN104318921A (en) Voice section segmentation detection method and system and spoken language detecting and evaluating method and system
CN112992191B (en) Voice endpoint detection method and device, electronic equipment and readable storage medium
CN106782508A (en) The cutting method of speech audio and the cutting device of speech audio
CN105225665A (en) A kind of audio recognition method and speech recognition equipment
CN104823235A (en) Speech recognition device
CN105895080A (en) Voice recognition model training method, speaker type recognition method and device
CN103680505A (en) Voice recognition method and voice recognition system
CN111883181A (en) Audio detection method and device, storage medium and electronic device
CN109215647A (en) Voice awakening method, electronic equipment and non-transient computer readable storage medium
CN105575402A (en) Network teaching real time voice analysis method
CN104103280A (en) Dynamic time warping algorithm based voice activity detection method and device
CN109243427A (en) A kind of car fault diagnosis method and device
CN109670148A (en) Collection householder method, device, equipment and storage medium based on speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1252735

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20180821

RJ01 Rejection of invention patent application after publication