CN108172219A - method and device for recognizing voice - Google Patents
method and device for recognizing voice Download PDFInfo
- Publication number
- CN108172219A CN108172219A CN201711133270.XA CN201711133270A CN108172219A CN 108172219 A CN108172219 A CN 108172219A CN 201711133270 A CN201711133270 A CN 201711133270A CN 108172219 A CN108172219 A CN 108172219A
- Authority
- CN
- China
- Prior art keywords
- voice
- sound
- section
- identified
- sections
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000000630 rising effect Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 13
- 238000010801 machine learning Methods 0.000 claims description 11
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 238000000605 extraction Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000004378 air conditioning Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention discloses a method and a device for recognizing voice. Wherein, the method comprises the following steps: determining a voice to be recognized; determining an effective sound segment from a plurality of sound segments contained in the voice to be recognized; and acquiring a recognition result of the voice to be recognized according to the command words extracted from the effective sound segment. The invention solves the technical problem of low voice recognition efficiency in the prior art.
Description
Technical field
The present invention relates to field of intelligent control, in particular to a kind of method and apparatus for identifying voice.
Background technology
With the fast development of science and technology, obtained in all trades and professions extensively by the method for voice control smart machine
General popularization.But in more noisy environment, smart machine is relatively low to the acceptance rate of voice, meanwhile, to identifying the standard of voice
True rate is also than relatively low.In addition, when the order word that control smart machine is contained in the voice of user, smart machine is according to user
Order word in voice completes corresponding operation, but user is not intended to control smart machine at this time, for example, the language of user
Sound content is " program in CCTV1 is seen very well ", and television set extracts the keyword of " CCTV1 " from the voice content of user,
And current TV programme are jumped in the TV programme of CCTV1.As shown in the above, it is existing to pass through voice control intelligence
The method of energy equipment has the defects of misrecognition.
In addition, to solve the above problems, the prior art mainly passes through Application on Voiceprint Recognition to mixing sound source (i.e. voice and non-language
The sound source of the mixture of tones together) category filter is carried out, the selectivity that single sound source is then carried out to the voice filtered out is handled.It holds
The equipment of the row above method needs to carry out a large amount of evaluation work, affects equipment performance, so as to reduce the standard of identification voice
True rate.
The problem of middle audio identification efficiency is low for the above-mentioned prior art, currently no effective solution has been proposed.
Invention content
An embodiment of the present invention provides a kind of method and apparatus for identifying voice, at least to solve voice knowledge in the prior art
The technical issues of other efficiency is low.
One side according to embodiments of the present invention provides a kind of method for identifying voice, including:Determine language to be identified
Sound;Effective acoustic sections are determined in the multiple sound sections included from voice to be identified;According to the order word extracted in effective acoustic sections, obtain
The recognition result of voice to be identified.
Another aspect according to embodiments of the present invention additionally provides a kind of device for identifying voice, including:First determining mould
Block, for determining voice to be identified;Second determining module, for determining effective sound from multiple sound sections that voice to be identified includes
Section;Acquisition module, for according to the order word extracted in effective acoustic sections, obtaining the recognition result of voice to be identified.
Another aspect according to embodiments of the present invention additionally provides a kind of storage medium, which includes storage
Program, wherein, the method that program performs identification voice.
Another aspect according to embodiments of the present invention additionally provides a kind of processor, which is used to run program,
In, the method for identifying voice is performed when program is run.
In embodiments of the present invention, in a manner that voice is identified in Application on Voiceprint Recognition, by determining voice to be identified,
Effective acoustic sections are determined in the multiple sound sections included from voice to be identified, and are treated according to the order word acquisition extracted in effective acoustic sections
It identifies the recognition result of voice, has achieved the purpose that accurately identify voice, it is achieved thereby that improve the accuracy rate of speech recognition
Technique effect, and then solve the technical issues of audio identification efficiency is low in the prior art.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair
Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of method flow diagram of identification voice according to embodiments of the present invention;
Fig. 2 is a kind of schematic diagram optionally classified to mixing sound according to embodiments of the present invention;
Fig. 3 is a kind of schematic diagram for optionally dividing independent sound section according to embodiments of the present invention;
Fig. 4 is a kind of schematic diagram of optional determining effective acoustic sections according to embodiments of the present invention;And
Fig. 5 is a kind of apparatus structure schematic diagram of identification voice according to embodiments of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention
The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects
It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, "
Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way
Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, a kind of embodiment of the method for identifying voice is provided, it should be noted that in attached drawing
The step of flow illustrates can perform in the computer system of such as a group of computer-executable instructions, although also,
Logical order is shown in flow chart, but in some cases, it can perform shown with the sequence being different from herein or retouch
The step of stating.
Fig. 1 is the method flow diagram of identification voice according to embodiments of the present invention, as shown in Figure 1, this method is including as follows
Step:
Step S102 determines voice to be identified.
It should be noted that in more noisy environment, the sound in environment includes non-voice and voice, wherein, it is non-
Voice is the sound in addition to the voice of people, for example, the sound of vehicle traveling.And above-mentioned voice to be identified is the voice for people.
It is accurately identified in order to the voice to people, needs to extract voice from the sound of noisy environment, wherein, it can
Voice to be identified to be extracted from the sound of noisy environment using the method for Application on Voiceprint Recognition.
The process that Application on Voiceprint Recognition is carried out to sound mainly includes feature extraction and speech recognition.Wherein, feature extraction
The task of journey is to extract and select the acoustics for having the characteristics such as separability is strong, stability is high to the vocal print of speaker or voice spy
Sign.And preferable acoustics or phonetic feature can not only efficiently differentiate different speakers, moreover it is possible in the language of same speaker
When sound changes, remain to keep the noiseproof feature of opposite stabilization.In the acoustics or phonetic feature for extracting sound and then
Speech recognition is carried out according to the acoustics of sound or phonetic feature.Wherein, speech recognition technology is widely used in internet and intelligence
Voice signal is mainly changed into the process of corresponding control instruction by household industry by way of identifying and parsing.Voice
Identification technology mainly includes Feature Extraction Technology, mode-matching technique and model training technology.
In addition it is also necessary to explanation, voice and non-is extracted from the sound of noisy environment by sound groove recognition technology in e
Voice in speech processes later, is only handled voice, and it is accurate to the identification of voice under noisy environment thus can to improve
Rate.
Step S104 determines effective acoustic sections in the multiple sound sections included from voice to be identified.
It should be noted that smart machine completes corresponding operating to order word included in above-mentioned effective acoustic sections in order to control
Order word cannot control smart machine included in order word rather than effective acoustic sections.For example, the language of non-effective sound section
Sound content is " when I closes air-conditioning, the dog of my family all can be very irritated ", and the voice content of effective acoustic sections is " opens as soon as possible
Air-conditioning ", the processor of smart machine can start air-conditioning according to the order word " opening air-conditioning " in effective acoustic sections, but not according to
Order word " close air-conditioning " in non-effective sound section and shutoff operation is carried out to air-conditioning.
In addition it is also necessary to illustrate, effective acoustic sections are determined in the multiple sound sections included from voice to be identified, and use
Order word in effective acoustic sections controls the order word in smart machine rather than effective acoustic sections that cannot control smart machine, so as to
Effectively prevent the misrecognition to voice.
Step S106 according to the order word extracted in effective acoustic sections, obtains the recognition result of voice to be identified.
It should be noted that order word is the word for controlling smart machine, for example, " opening air-conditioning ", " closing sky
It adjusts ".
Specifically, after order word is extracted from effective acoustic sections, the processor of smart machine is also needed to order word
It is further processed, for example, order word to be converted to corresponding control instruction, the processor of smart machine refers to control
Other units for being sent to smart machine are enabled, perform operation corresponding with control instruction.
Based on the scheme that above-mentioned steps S102 is limited to step S106, can know, by determining voice to be identified, from
Effective acoustic sections are determined in multiple sound sections that voice to be identified includes, and is obtained according to the order word extracted in effective acoustic sections and waits to know
The recognition result of other voice.
It is easily noted that, since voice to be identified includes multiple sound sections, order word may be included in multiple sound sections,
Therefore, if only the order word in extraction sound section is likely to result in the misrecognition to voice, therefore, voice to be identified is being determined
Comprising multiple sound sections after, from multiple sound sections determine effective acoustic sections, in effective acoustic sections order word carry out voice knowledge
Not, it is possible to reduce voice is caused to misidentify the generation of phenomenon due to including order word in longer voice, so as to further carry
The high accuracy rate of speech recognition.
The above is it is found that the present embodiment can achieve the purpose that accurately identify voice, it is achieved thereby that improving voice knowledge
The technique effect of other accuracy rate, and then solve the technical issues of audio identification efficiency is low in the prior art.
In a kind of optional embodiment, determine that voice to be identified includes:
Step S1020 obtains mixing sound, wherein, mixing sound includes at least voice and non-voice;
Step S1022 extracts at least one voice using voiceprint recognition algorithm from mixing sound;
Step S1024 carries out classification processing at least one voice based on voiceprint recognition algorithm, obtains classification results;
Step S1026 determines voice to be identified from classification results.
Specifically, a kind of schematic diagram optionally classified to mixing sound as shown in Figure 2.In fig. 2, arrow
The left side is, from the voice of people extracted in sound is mixed, includes two sound sources, the two sound sources may in the voice of the people
Come from different users, wherein, black rectangle represents multiple sound sections in sound source 1, and white rectangle represents multiple in sound source 2
Sound section.Due to phonetic feature possessed by the voice of different user be it is different, can be special according to the voice of each user
Sign distinguishes multiple voices, i.e., classification processing is carried out at least one voice, and classification results are the right of arrow in Fig. 2
Shown, as shown in Figure 2, sound source 1 is made of sound section 1, sound section 4,5 harmony section 6 of sound section, and sound source 2 is by sound section 2,3 harmony section 7 of sound section
Composition.
In addition, it is not all with per family voice control can be carried out to smart machine, only meet the use of certain permission
Family just can control smart machine, therefore, after at least one speech differentiation is come, can will distinguish obtained multiple voices
Phonetic feature is matched with the phonetic feature to prestore in smart machine, if successful match, the voice of successful match is made
For voice to be identified;If matching is unsuccessful, give up the unsuccessful voice of matching.
It should be noted that at least one voice can be extracted from mixing sound by way of machine learning, specifically
Step is as follows:
Step S1022a analyzes mixing sound using preset model, determines at least one of mixing sound language
Sound, wherein, preset model trains to obtain using multigroup voice data by machine learning, every group in multigroup voice data
Data include:Mix the label of sound and the voice in mark mixing sound;
Step S1022b extracts determining at least one voice from mixing sound.
Specifically, at least one of above-mentioned mixing sound voice can be the voice from different user, wherein, it is different
Phonetic feature possessed by the voice of user is also different.Therefore, after the mixing sound in getting noisy environment,
Mixing sound is analyzed using beforehand through the preset model that machine learning is trained, determines voice in mixing sound
Phonetic feature, for example, determining quantity, amplitude type of sound etc. of different frequency in mixing sound.In order to from sound
In extract voice, neural network model can be established, the phonetic feature of multigroup mixing sound is obtained in advance, and pass through and manually mark
The mode of note sets corresponding label for the phonetic feature in every group of mixing sound, then using the label set to creolized language
Sound is trained, and obtains preset model.After preset model is built, input of the sound as preset model can will be mixed, and
The output of preset model is at least one voice extracted from mixing sound.
It should be noted that at least one voice is divided by building disaggregated model using the method for machine learning
Class processing, is as follows:
Step S1024a obtains at least one voice;
Step S1024b analyzes at least one voice using disaggregated model, determines every at least one voice
The sound-type of a voice, wherein, disaggregated model trains to obtain using multigroup voice data by machine learning, Duo Zuyu
Every group of voice data in sound data includes:The label of at least one voice and the sound-type at least one voice;
Step S1024c, the sound-type of each voice at least one voice divide at least one voice
Class obtains classification results.
Specifically, after at least one voice is got, using the classification trained beforehand through machine learning
Model analyzes at least one voice, determines the sound-type of each voice at least one voice, wherein, language can be passed through
Phonetic feature possessed by sound determines the type belonging to voice, for example, phonetic feature possessed by child and adult voice
Phonetic feature is different, therefore, can determine that the voice belongs to adult or child by phonetic feature.In order to according to voice
Phonetic feature determine the sound-type belonging to voice, neural network model can be established, obtain the language of multigroup voice in advance
Sound feature, and corresponding label is set for the sound-type in every group of voice by way of manually marking, then using setting
Good label is trained voice, obtains disaggregated model.It, can be using multiple voices as classification mould after disaggregated model is built
The input of type, and the output of disaggregated model is the sound-type corresponding to each voice.
It should be noted that after voice to be identified is determined, it is also necessary to the multiple languages included from voice to be identified
Extract effective acoustic sections in segment, and the order word in effective acoustic sections controls smart machine.It is included from voice to be identified
Sound section in determine effective acoustic sections, specifically comprise the following steps:
Step S1040 determines multiple sound sections in voice to be identified;
Multiple sound sections are divided at least by step S1042 according to the duration parameters of multiple sound sections that voice to be identified includes
One independent sound section;
Step S1044 determines effective acoustic sections according to the duration of at least one independent sound section.
In a kind of optional embodiment, for user during voice is sent out, sentence will have pause, therefore, can root
Voice to be identified is divided into multiple sound sections according to the time of pause, for example, a length of 10 seconds during the voice of user, preceding 2 seconds users do not have
There is a pause, the 3rd second is the dead time, and 4-7 second are voice, are paused within the 8th second, and 9-10 seconds are voice, as a result, can will be to be identified
Voice is divided into three sound sections, i.e. 1-2 seconds, 4-7 seconds and 9-10 seconds corresponding sound section.
In a kind of optional embodiment, after multiple sound sections in determining voice to be identified, according to voice to be identified
Comprising the duration parameters of multiple sound sections multiple sound sections are divided at least one independent sound section, wherein, duration parameters are at least wrapped
It includes one of following:At the beginning of sound section and the end time of sound section.The specific method for dividing independent sound section is as follows:
Step S1042a, at the beginning of determining the end time of the first sound section and rising tone section, wherein, the first sound section
A upper sound section adjacent for rising tone section, also, the end time of the first sound section is less than at the beginning of rising tone section;
Step S1042b, the first time difference at the beginning of the end time of the first sound section of calculating and rising tone section;
Step S1042c in the case where first time difference is less than the first predetermined threshold value, determines the first sound section and second
Sound section belongs to identical independent sound section;
Step S1042d in the case where first time difference is greater than or equal to the first predetermined threshold value, determines the first sound section
Belong to different independent sound sections with rising tone section.
Now illustrated by taking Fig. 3 as an example, wherein, Fig. 3 shows a kind of schematic diagram for optionally dividing independent sound section.Fig. 3
Including 6 sound sections, respectively sound section 1, sound section 2, sound section 3, sound section 4,5 harmony section 6 of sound section, wherein, at the beginning of sound section 1 and
End time is respectively t1 and t2, and at the beginning of sound section 2 and the end time is respectively t3 and t4, at the beginning of sound section 3 and
End time is respectively t5 and t6, and at the beginning of sound section 4 and the end time is respectively t7 and t8, at the beginning of sound section 5 and
End time is respectively t9 and t10, and at the beginning of sound section 6 and the end time is respectively t11 and t12.Assuming that the first default threshold
It is worth for δ, due to t3-t2 < δ, t5-t4 < δ, therefore, sound section 1,2 harmony section 3 of sound section is divided into independent sound section 1;Due to t7-
T6 > δ, t9-t8 > δ, therefore, sound section 4 are individually divided into an independent sound section, i.e., independent sound section 2;Due to t9-t8 > δ, t11-
Therefore 5 harmony section 6 of sound section, is divided into independent sound section 3 by t10 < δ.
It is determining independent sound section and then effective acoustic sections is determined according to ready-portioned independent sound section, according at least one only
The duration of vertical sound section determines that effective acoustic sections specifically comprise the following steps:
Step S1044a determines the duration of each independent sound section at least one independent sound section;
Step S1044b in the case where the duration of independent sound section is less than the second predetermined threshold value, determines independent sound Duan Weiyou
Effect sound section.
Now illustrated by taking Fig. 4 as an example, wherein, Fig. 4 shows a kind of schematic diagram of optional determining effective acoustic sections.Fig. 4
Including 3 independent sound sections, respectively independent sound section 1, independent sound section 2 and independent sound section 3, wherein, at the beginning of independent sound section 1
It is respectively T1 and T2 with the end time, the duration λ 1=T2-T1 of independent sound section 1;At the beginning of independent sound section 2 and the end time
Respectively T3 and T4, the duration λ 2=T4-T3 of independent sound section 2;At the beginning of independent sound section 3 and the end time be respectively T5 and
T6, the duration λ 3=T6-T5 of independent sound section 3.Assuming that the second predetermined threshold value is λ, due to 13 > λ of > λ, λ of λ, and 2 < λ of λ, therefore,
Using independent sound section 2 as effective acoustic sections, independent sound section 1 and independent sound section 3 are non-effective sound section.
It should be noted that after the effective acoustic sections during voice to be identified is determined, extraction effective acoustic sections are included
Order word, and smart machine is controlled to complete corresponding operation according to the order word.Wherein, according to extracting in effective acoustic sections
Order word, the recognition result for obtaining voice to be identified specifically comprise the following steps:
Step S1060 obtains the order word in effective acoustic sections;
Step S1062 determines default sound bank control instruction corresponding with order word;
Step S1064, control device are completed and the corresponding operation of control instruction.
Specifically, after effective acoustic sections are got, the processor of smart machine is by using extraction model to effective sound
Duan Jinhang is analyzed, and determines the order word in effective acoustic sections, wherein, extraction model is to pass through machine learning using multigroup effective acoustic sections
What training obtained, every group of effective acoustic sections in multigroup effective acoustic sections include:It is ordered in effective acoustic sections and mark effective acoustic sections
The label of word.After order word is extracted from effective acoustic sections, by the keyword progress in order word and default sound bank
Match, determined if existed in default sound bank with the matched keyword of order word, basis with the matched keyword of order word
Control instruction, and smart machine is controlled to complete corresponding operation according to control instruction;If it is not found in default sound bank
The keyword to match with order word, then smart machine is without any operation.
Embodiment 2
According to embodiments of the present invention, a kind of device embodiment for identifying voice is additionally provided, wherein, Fig. 5 is according to this hair
The apparatus structure schematic diagram of the identification voice of bright embodiment, as shown in figure 5, the device includes:First determining module 501, second
Determining module 503 and acquisition module 505.
Wherein, the first determining module 501, for determining voice to be identified;Second determining module 503, for to be identified
Effective acoustic sections are determined in multiple sound sections that voice includes;Acquisition module 505, for according to the order extracted in effective acoustic sections
Word obtains the recognition result of voice to be identified.
It should be noted that above-mentioned first determining module 501, the second determining module 503 and acquisition module 505 correspond to
Step S102 to step S106 in embodiment 1, three modules are identical with example and application scenarios that corresponding step is realized,
But it is not limited to the above embodiments 1 disclosure of that.
In a kind of optional embodiment, the first determining module includes:First acquisition module, extraction module, the first processing
Module and third determining module.Wherein, the first acquisition module mixes sound for obtaining, wherein, mixing sound includes at least
Voice and non-voice;Extraction module, for extracting at least one voice from mixing sound using voiceprint recognition algorithm;First
Processing module carries out classification processing at least one voice for being based on voiceprint recognition algorithm, obtains classification results;Third determines
Module, for determining voice to be identified from classification results.
It should be noted that above-mentioned first acquisition module, extraction module, first processing module and third determining module pair
The example and applied field that should be realized in the step S1020 in embodiment 1 to step S1026, four modules with corresponding step
Scape is identical, but is not limited to the above embodiments 1 disclosure of that.
In a kind of optional embodiment, extraction module includes:4th determining module and the 5th determining module.Wherein,
4th determining module for being analyzed using preset model mixing sound, determines at least one of mixing sound voice,
Wherein, preset model trains to obtain using multigroup voice data by machine learning, every group of number in multigroup voice data
According to including:Mix the label of sound and the voice in mark mixing sound;5th determining module, for from mixing sound
Extract determining at least one voice.
It should be noted that above-mentioned 4th determining module and the 5th determining module correspond to the step in embodiment 1
S1022a to step S1022b, two modules are identical with example and application scenarios that corresponding step is realized, but are not limited to
State 1 disclosure of that of embodiment.
In a kind of optional embodiment, processing module includes:At second acquisition module, the 6th determining module and second
Manage module.Wherein, the second acquisition module, for obtaining at least one voice;6th determining module, for using disaggregated model pair
At least one voice is analyzed, and determines the sound-type of each voice at least one voice, wherein, disaggregated model is makes
It is trained with multigroup voice data by machine learning, every group of voice data in multigroup voice data includes:At least
The label of one voice and the sound-type at least one voice;Second processing module, for according at least one voice
In the sound-type of each voice classify at least one voice, obtain classification results.
Implement it should be noted that above-mentioned second acquisition module, the 6th determining module and Second processing module correspond to
Step S1024a to step S1024c in example 1, three modules are identical with example and application scenarios that corresponding step is realized,
But it is not limited to the above embodiments 1 disclosure of that.
In a kind of optional embodiment, the second determining module includes:7th determining module, division module and the 8th are really
Cover half block.Wherein, the 7th determining module, for determining multiple sound sections in voice to be identified;Division module is waited to know for basis
Multiple sound sections are divided at least one independent sound section by the duration parameters of multiple sound sections that other voice includes;8th determining module,
For determining effective acoustic sections according to the duration of at least one independent sound section.
It should be noted that above-mentioned 7th determining module, division module and the 8th determining module correspond in embodiment 1
Step S1040 to step S1044, the example and application scenarios that three modules and corresponding step are realized be identical but unlimited
In 1 disclosure of that of above-described embodiment.
In a kind of optional embodiment, duration parameters include at least one of following:At the beginning of sound section and sound section
End time, wherein, division module includes:9th determining module, computing module, the tenth determining module and the 11st determine
Module.Wherein, the 9th determining module, at the beginning of determining the end time of the first sound section and rising tone section, wherein,
A first sound section upper sound section adjacent for rising tone section, also, the end time of the first sound section is less than the beginning of rising tone section
Time;Computing module, for the first time difference at the beginning of the end time of the first sound section of calculating and rising tone section;The
Ten determining modules in the case of being less than the first predetermined threshold value in first time difference, determine the first sound section and rising tone section
Belong to identical independent sound section;11st determining module, for being greater than or equal to the first predetermined threshold value in first time difference
In the case of, determine that the first sound section and rising tone section belong to different independent sound sections.
It should be noted that above-mentioned 9th determining module, computing module, the tenth determining module and the 11st determining module
Corresponding to the step S1042a in embodiment 1 to step S1042d, example that four modules and corresponding step are realized and should
It is identical with scene, but it is not limited to the above embodiments 1 disclosure of that.
In a kind of optional embodiment, the 8th determining module includes:12nd determining module and the 13rd determining mould
Block.Wherein, the 12nd determining module, for determining the duration of each independent sound section at least one independent sound section;13rd
Determining module, in the case of being less than the second predetermined threshold value in the duration of independent sound section, it is effective acoustic sections to determine independent sound section.
It should be noted that above-mentioned 12nd determining module and the 13rd determining module correspond to the step in embodiment 1
Rapid S1044a to step S1044b, two modules are identical with example and application scenarios that corresponding step is realized, but are not limited to
1 disclosure of that of above-described embodiment.
In a kind of optional embodiment, acquisition module includes:Third acquisition module, the 14th determining module and control
Module.Wherein, third acquisition module, for obtaining the order word in effective acoustic sections;14th determining module, it is default for determining
Sound bank control instruction corresponding with order word;Control module is completed and the corresponding operation of control instruction for control device.
It should be noted that above-mentioned third acquisition module, the 14th determining module and control module correspond to embodiment 1
In step S1060 to step S1064, three modules are identical with example and application scenarios that corresponding step is realized, but not
It is limited to 1 disclosure of that of above-described embodiment.
Embodiment 3
Another aspect according to embodiments of the present invention additionally provides a kind of storage medium, which includes storage
Program, wherein, the method that program performs the identification voice in above-described embodiment 1.
Embodiment 4
Another aspect according to embodiments of the present invention additionally provides a kind of processor, which is used to run program,
In, program performs the method for identifying voice in above-described embodiment 1 when running.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or
Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially
The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products
It embodies, which is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or
Part steps.And aforementioned storage medium includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code
Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (11)
- A kind of 1. method for identifying voice, which is characterized in that including:Determine voice to be identified;Effective acoustic sections are determined in the multiple sound sections included from the voice to be identified;According to the order word extracted in the effective acoustic sections, the recognition result of the voice to be identified is obtained.
- 2. according to the method described in claim 1, it is characterized in that, determine that voice to be identified includes:Mixing sound is obtained, wherein, the mixing sound includes at least voice and non-voice;At least one voice is extracted from the mixing sound using voiceprint recognition algorithm;Classification processing is carried out at least one voice based on the voiceprint recognition algorithm, obtains classification results;The voice to be identified is determined from the classification results.
- 3. it according to the method described in claim 2, it is characterized in that, is extracted from the mixing sound using voiceprint recognition algorithm Go out at least one voice to include:The mixing sound is analyzed using preset model, determines at least one of mixing sound voice, wherein, The preset model trains to obtain using multigroup voice data by machine learning, every group in multigroup voice data Data include:The label of the mixing sound and the voice in the mark mixing sound;Determining at least one voice is extracted from the mixing sound.
- 4. according to the method described in claim 2, it is characterized in that, based on the voiceprint recognition algorithm at least one language Sound carries out classification processing, obtains classification results and includes:Obtain at least one voice;At least one voice is analyzed using disaggregated model, determines each voice at least one voice Sound-type, wherein, the disaggregated model trains to obtain using multigroup voice data by machine learning, multigroup language Every group of voice data in sound data includes:Sound-type at least one voice and at least one voice Label;The sound-type of each voice at least one voice classifies at least one voice, obtains The classification results.
- 5. it according to the method described in claim 1, it is characterized in that, is determined from the sound section that the voice to be identified includes effective Sound section, including:Determine multiple sound sections in the voice to be identified;According to the duration parameters of multiple sound sections that the voice to be identified includes by the multiple sound section be divided into it is at least one solely Vertical sound section;The effective acoustic sections are determined according to the duration of at least one independent sound section.
- 6. according to the method described in claim 5, it is characterized in that, the duration parameters are including at least one of following:Sound section Time started and the end time of sound section, wherein, it will according to the duration parameters of multiple sound sections that the voice to be identified includes The multiple sound section is divided at least one independent sound section and includes:At the beginning of determining the end time of the first sound section and rising tone section, wherein, the first sound section is described second The adjacent upper sound section of sound section, also, the end time of the first sound section is less than at the beginning of the rising tone section;Calculate the first time difference at the beginning of end time of the first sound section and the rising tone section;In the case where the first time difference is less than the first predetermined threshold value, the first sound section and the rising tone section are determined Belong to identical independent sound section;In the case where the first time difference is greater than or equal to first predetermined threshold value, the first sound section and institute are determined State the independent sound section that rising tone section belongs to different.
- 7. according to the method described in claim 5, it is characterized in that, institute is determined according to the duration of at least one independent sound section Effective acoustic sections are stated to include:Determine the duration of each independent sound section at least one independent sound section;In the case where the duration of the independent sound section is less than the second predetermined threshold value, it is effective sound to determine the independent sound section Section.
- 8. according to the method described in claim 1, it is characterized in that, according to the order word extracted in the effective acoustic sections, obtain The recognition result of the voice to be identified is taken to include:Obtain the order word in the effective acoustic sections;Determine default sound bank control instruction corresponding with the order word;Control device is completed and the corresponding operation of the control instruction.
- 9. a kind of device for identifying voice, which is characterized in that including:First determining module, for determining voice to be identified;Second determining module, for determining effective acoustic sections from multiple sound sections that the voice to be identified includes;Acquisition module, for according to the order word extracted in the effective acoustic sections, obtaining the identification knot of the voice to be identified Fruit.
- 10. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein, described program right of execution The method that profit requires the identification voice described in any one in 1 to 8.
- 11. a kind of processor, which is characterized in that the processor is used to run program, wherein, right of execution when described program is run The method that profit requires the identification voice described in any one in 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711133270.XA CN108172219B (en) | 2017-11-14 | 2017-11-14 | Method and device for recognizing voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711133270.XA CN108172219B (en) | 2017-11-14 | 2017-11-14 | Method and device for recognizing voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108172219A true CN108172219A (en) | 2018-06-15 |
CN108172219B CN108172219B (en) | 2021-02-26 |
Family
ID=62527360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711133270.XA Active CN108172219B (en) | 2017-11-14 | 2017-11-14 | Method and device for recognizing voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108172219B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145124A (en) * | 2018-08-16 | 2019-01-04 | 格力电器(武汉)有限公司 | Information storage method and device, storage medium and electronic device |
CN110310657A (en) * | 2019-07-10 | 2019-10-08 | 北京猎户星空科技有限公司 | A kind of audio data processing method and device |
WO2020042603A1 (en) * | 2018-08-27 | 2020-03-05 | 广东美的制冷设备有限公司 | Voice control method, module, household appliance, system and computer storage medium |
RU2735363C1 (en) * | 2019-08-16 | 2020-10-30 | Бейджин Сяоми Мобайл Софтвеа Ко., Лтд. | Method and device for sound processing and data medium |
CN111951786A (en) * | 2019-05-16 | 2020-11-17 | 武汉Tcl集团工业研究院有限公司 | Training method and device of voice recognition model, terminal equipment and medium |
WO2021031811A1 (en) * | 2019-08-21 | 2021-02-25 | 华为技术有限公司 | Method and device for voice enhancement |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1017041B1 (en) * | 1995-11-15 | 2005-05-11 | Hitachi, Ltd. | Voice recognizing and translating system |
CN102655002A (en) * | 2011-03-01 | 2012-09-05 | 株式会社理光 | Audio processing method and audio processing equipment |
US8412455B2 (en) * | 2010-08-13 | 2013-04-02 | Ambit Microsystems (Shanghai) Ltd. | Voice-controlled navigation device and method |
CN103489454A (en) * | 2013-09-22 | 2014-01-01 | 浙江大学 | Voice endpoint detection method based on waveform morphological characteristic clustering |
CN104021785A (en) * | 2014-05-28 | 2014-09-03 | 华南理工大学 | Method of extracting speech of most important guest in meeting |
CN104143326A (en) * | 2013-12-03 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Voice command recognition method and device |
US20140350918A1 (en) * | 2013-05-24 | 2014-11-27 | Tencent Technology (Shenzhen) Co., Ltd. | Method and system for adding punctuation to voice files |
CN104464723A (en) * | 2014-12-16 | 2015-03-25 | 科大讯飞股份有限公司 | Voice interaction method and system |
CN104536978A (en) * | 2014-12-05 | 2015-04-22 | 奇瑞汽车股份有限公司 | Voice data identifying method and device |
CN105280183A (en) * | 2015-09-10 | 2016-01-27 | 百度在线网络技术(北京)有限公司 | Voice interaction method and system |
US9324319B2 (en) * | 2013-05-21 | 2016-04-26 | Speech Morphing Systems, Inc. | Method and apparatus for exemplary segment classification |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN106601233A (en) * | 2016-12-22 | 2017-04-26 | 北京元心科技有限公司 | Voice command recognition method and device and electronic equipment |
CN106601241A (en) * | 2016-12-26 | 2017-04-26 | 河南思维信息技术有限公司 | Automatic time correcting method for recording file |
CN106683680A (en) * | 2017-03-10 | 2017-05-17 | 百度在线网络技术(北京)有限公司 | Speaker recognition method and device and computer equipment and computer readable media |
CN106782506A (en) * | 2016-11-23 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method that recorded audio is divided into section |
-
2017
- 2017-11-14 CN CN201711133270.XA patent/CN108172219B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1017041B1 (en) * | 1995-11-15 | 2005-05-11 | Hitachi, Ltd. | Voice recognizing and translating system |
US8412455B2 (en) * | 2010-08-13 | 2013-04-02 | Ambit Microsystems (Shanghai) Ltd. | Voice-controlled navigation device and method |
CN102655002A (en) * | 2011-03-01 | 2012-09-05 | 株式会社理光 | Audio processing method and audio processing equipment |
US9324319B2 (en) * | 2013-05-21 | 2016-04-26 | Speech Morphing Systems, Inc. | Method and apparatus for exemplary segment classification |
US20140350918A1 (en) * | 2013-05-24 | 2014-11-27 | Tencent Technology (Shenzhen) Co., Ltd. | Method and system for adding punctuation to voice files |
CN103489454A (en) * | 2013-09-22 | 2014-01-01 | 浙江大学 | Voice endpoint detection method based on waveform morphological characteristic clustering |
CN104143326A (en) * | 2013-12-03 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Voice command recognition method and device |
CN104021785A (en) * | 2014-05-28 | 2014-09-03 | 华南理工大学 | Method of extracting speech of most important guest in meeting |
CN104536978A (en) * | 2014-12-05 | 2015-04-22 | 奇瑞汽车股份有限公司 | Voice data identifying method and device |
CN104464723A (en) * | 2014-12-16 | 2015-03-25 | 科大讯飞股份有限公司 | Voice interaction method and system |
CN105280183A (en) * | 2015-09-10 | 2016-01-27 | 百度在线网络技术(北京)有限公司 | Voice interaction method and system |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN106782506A (en) * | 2016-11-23 | 2017-05-31 | 语联网(武汉)信息技术有限公司 | A kind of method that recorded audio is divided into section |
CN106601233A (en) * | 2016-12-22 | 2017-04-26 | 北京元心科技有限公司 | Voice command recognition method and device and electronic equipment |
CN106601241A (en) * | 2016-12-26 | 2017-04-26 | 河南思维信息技术有限公司 | Automatic time correcting method for recording file |
CN106683680A (en) * | 2017-03-10 | 2017-05-17 | 百度在线网络技术(北京)有限公司 | Speaker recognition method and device and computer equipment and computer readable media |
Non-Patent Citations (1)
Title |
---|
吕西宝: "移动机器人语音控制技术研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145124A (en) * | 2018-08-16 | 2019-01-04 | 格力电器(武汉)有限公司 | Information storage method and device, storage medium and electronic device |
CN109145124B (en) * | 2018-08-16 | 2022-02-25 | 格力电器(武汉)有限公司 | Information storage method and device, storage medium and electronic device |
WO2020042603A1 (en) * | 2018-08-27 | 2020-03-05 | 广东美的制冷设备有限公司 | Voice control method, module, household appliance, system and computer storage medium |
CN111951786A (en) * | 2019-05-16 | 2020-11-17 | 武汉Tcl集团工业研究院有限公司 | Training method and device of voice recognition model, terminal equipment and medium |
CN110310657A (en) * | 2019-07-10 | 2019-10-08 | 北京猎户星空科技有限公司 | A kind of audio data processing method and device |
CN110310657B (en) * | 2019-07-10 | 2022-02-08 | 北京猎户星空科技有限公司 | Audio data processing method and device |
RU2735363C1 (en) * | 2019-08-16 | 2020-10-30 | Бейджин Сяоми Мобайл Софтвеа Ко., Лтд. | Method and device for sound processing and data medium |
US11264027B2 (en) | 2019-08-16 | 2022-03-01 | Beijing Xiaomi Mobile Software Co., Ltd. | Method and apparatus for determining target audio data during application waking-up |
WO2021031811A1 (en) * | 2019-08-21 | 2021-02-25 | 华为技术有限公司 | Method and device for voice enhancement |
CN112420063A (en) * | 2019-08-21 | 2021-02-26 | 华为技术有限公司 | Voice enhancement method and device |
CN112420063B (en) * | 2019-08-21 | 2024-10-18 | 华为技术有限公司 | Voice enhancement method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108172219B (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108172219A (en) | method and device for recognizing voice | |
WO2021012734A1 (en) | Audio separation method and apparatus, electronic device and computer-readable storage medium | |
CN110021308B (en) | Speech emotion recognition method and device, computer equipment and storage medium | |
CN109523986B (en) | Speech synthesis method, apparatus, device and storage medium | |
CN103700370B (en) | A kind of radio and television speech recognition system method and system | |
CN110334201A (en) | A kind of intension recognizing method, apparatus and system | |
CN110136749A (en) | The relevant end-to-end speech end-point detecting method of speaker and device | |
DE112021001064T5 (en) | Device-directed utterance recognition | |
CN108305643A (en) | The determination method and apparatus of emotion information | |
CN108447471A (en) | Audio recognition method and speech recognition equipment | |
CN110379441B (en) | Voice service method and system based on countermeasure type artificial intelligence network | |
CN111524527A (en) | Speaker separation method, device, electronic equipment and storage medium | |
CN110148399A (en) | A kind of control method of smart machine, device, equipment and medium | |
CN111901627B (en) | Video processing method and device, storage medium and electronic equipment | |
CN113223560A (en) | Emotion recognition method, device, equipment and storage medium | |
CN109376363A (en) | A kind of real-time voice interpretation method and device based on earphone | |
CN109065051A (en) | Voice recognition processing method and device | |
CN106531195B (en) | A kind of dialogue collision detection method and device | |
CN107274900A (en) | Information processing method and its system for control terminal | |
Alghifari et al. | On the use of voice activity detection in speech emotion recognition | |
CN109830240A (en) | Method, apparatus and system based on voice operating instruction identification user's specific identity | |
CN106205610B (en) | A kind of voice information identification method and equipment | |
CN111128240B (en) | Voice emotion recognition method based on anti-semantic-erasure | |
CN112667787A (en) | Intelligent response method, system and storage medium based on phonetics label | |
CN111462762A (en) | Speaker vector regularization method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |