CN108172219A

CN108172219A - method and device for recognizing voice

Info

Publication number: CN108172219A
Application number: CN201711133270.XA
Authority: CN
Inventors: 徐夏伶; 毛跃辉; 梁博
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2018-06-15
Anticipated expiration: 2037-11-14
Also published as: CN108172219B

Abstract

The invention discloses a method and a device for recognizing voice. Wherein, the method comprises the following steps: determining a voice to be recognized; determining an effective sound segment from a plurality of sound segments contained in the voice to be recognized; and acquiring a recognition result of the voice to be recognized according to the command words extracted from the effective sound segment. The invention solves the technical problem of low voice recognition efficiency in the prior art.

Description

The method and apparatus for identifying voice

Technical field

The present invention relates to field of intelligent control, in particular to a kind of method and apparatus for identifying voice.

Background technology

With the fast development of science and technology, obtained in all trades and professions extensively by the method for voice control smart machine General popularization.But in more noisy environment, smart machine is relatively low to the acceptance rate of voice, meanwhile, to identifying the standard of voice True rate is also than relatively low.In addition, when the order word that control smart machine is contained in the voice of user, smart machine is according to user Order word in voice completes corresponding operation, but user is not intended to control smart machine at this time, for example, the language of user Sound content is " program in CCTV1 is seen very well ", and television set extracts the keyword of " CCTV1 " from the voice content of user, And current TV programme are jumped in the TV programme of CCTV1.As shown in the above, it is existing to pass through voice control intelligence The method of energy equipment has the defects of misrecognition.

In addition, to solve the above problems, the prior art mainly passes through Application on Voiceprint Recognition to mixing sound source (i.e. voice and non-language The sound source of the mixture of tones together) category filter is carried out, the selectivity that single sound source is then carried out to the voice filtered out is handled.It holds The equipment of the row above method needs to carry out a large amount of evaluation work, affects equipment performance, so as to reduce the standard of identification voice True rate.

The problem of middle audio identification efficiency is low for the above-mentioned prior art, currently no effective solution has been proposed.

Invention content

An embodiment of the present invention provides a kind of method and apparatus for identifying voice, at least to solve voice knowledge in the prior art The technical issues of other efficiency is low.

One side according to embodiments of the present invention provides a kind of method for identifying voice, including：Determine language to be identified Sound；Effective acoustic sections are determined in the multiple sound sections included from voice to be identified；According to the order word extracted in effective acoustic sections, obtain The recognition result of voice to be identified.

Another aspect according to embodiments of the present invention additionally provides a kind of device for identifying voice, including：First determining mould Block, for determining voice to be identified；Second determining module, for determining effective sound from multiple sound sections that voice to be identified includes Section；Acquisition module, for according to the order word extracted in effective acoustic sections, obtaining the recognition result of voice to be identified.

Another aspect according to embodiments of the present invention additionally provides a kind of storage medium, which includes storage Program, wherein, the method that program performs identification voice.

Another aspect according to embodiments of the present invention additionally provides a kind of processor, which is used to run program, In, the method for identifying voice is performed when program is run.

In embodiments of the present invention, in a manner that voice is identified in Application on Voiceprint Recognition, by determining voice to be identified, Effective acoustic sections are determined in the multiple sound sections included from voice to be identified, and are treated according to the order word acquisition extracted in effective acoustic sections It identifies the recognition result of voice, has achieved the purpose that accurately identify voice, it is achieved thereby that improve the accuracy rate of speech recognition Technique effect, and then solve the technical issues of audio identification efficiency is low in the prior art.

Description of the drawings

Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is a kind of method flow diagram of identification voice according to embodiments of the present invention；

Fig. 2 is a kind of schematic diagram optionally classified to mixing sound according to embodiments of the present invention；

Fig. 3 is a kind of schematic diagram for optionally dividing independent sound section according to embodiments of the present invention；

Fig. 4 is a kind of schematic diagram of optional determining effective acoustic sections according to embodiments of the present invention；And

Fig. 5 is a kind of apparatus structure schematic diagram of identification voice according to embodiments of the present invention.

Specific embodiment

In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects It encloses.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.

Embodiment 1

According to embodiments of the present invention, a kind of embodiment of the method for identifying voice is provided, it should be noted that in attached drawing The step of flow illustrates can perform in the computer system of such as a group of computer-executable instructions, although also, Logical order is shown in flow chart, but in some cases, it can perform shown with the sequence being different from herein or retouch The step of stating.

Fig. 1 is the method flow diagram of identification voice according to embodiments of the present invention, as shown in Figure 1, this method is including as follows Step：

Step S102 determines voice to be identified.

It should be noted that in more noisy environment, the sound in environment includes non-voice and voice, wherein, it is non- Voice is the sound in addition to the voice of people, for example, the sound of vehicle traveling.And above-mentioned voice to be identified is the voice for people. It is accurately identified in order to the voice to people, needs to extract voice from the sound of noisy environment, wherein, it can Voice to be identified to be extracted from the sound of noisy environment using the method for Application on Voiceprint Recognition.

The process that Application on Voiceprint Recognition is carried out to sound mainly includes feature extraction and speech recognition.Wherein, feature extraction The task of journey is to extract and select the acoustics for having the characteristics such as separability is strong, stability is high to the vocal print of speaker or voice spy Sign.And preferable acoustics or phonetic feature can not only efficiently differentiate different speakers, moreover it is possible in the language of same speaker When sound changes, remain to keep the noiseproof feature of opposite stabilization.In the acoustics or phonetic feature for extracting sound and then Speech recognition is carried out according to the acoustics of sound or phonetic feature.Wherein, speech recognition technology is widely used in internet and intelligence Voice signal is mainly changed into the process of corresponding control instruction by household industry by way of identifying and parsing.Voice Identification technology mainly includes Feature Extraction Technology, mode-matching technique and model training technology.

In addition it is also necessary to explanation, voice and non-is extracted from the sound of noisy environment by sound groove recognition technology in e Voice in speech processes later, is only handled voice, and it is accurate to the identification of voice under noisy environment thus can to improve Rate.

Step S104 determines effective acoustic sections in the multiple sound sections included from voice to be identified.

It should be noted that smart machine completes corresponding operating to order word included in above-mentioned effective acoustic sections in order to control Order word cannot control smart machine included in order word rather than effective acoustic sections.For example, the language of non-effective sound section Sound content is " when I closes air-conditioning, the dog of my family all can be very irritated ", and the voice content of effective acoustic sections is " opens as soon as possible Air-conditioning ", the processor of smart machine can start air-conditioning according to the order word " opening air-conditioning " in effective acoustic sections, but not according to Order word " close air-conditioning " in non-effective sound section and shutoff operation is carried out to air-conditioning.

In addition it is also necessary to illustrate, effective acoustic sections are determined in the multiple sound sections included from voice to be identified, and use Order word in effective acoustic sections controls the order word in smart machine rather than effective acoustic sections that cannot control smart machine, so as to Effectively prevent the misrecognition to voice.

Step S106 according to the order word extracted in effective acoustic sections, obtains the recognition result of voice to be identified.

It should be noted that order word is the word for controlling smart machine, for example, " opening air-conditioning ", " closing sky It adjusts ".

Specifically, after order word is extracted from effective acoustic sections, the processor of smart machine is also needed to order word It is further processed, for example, order word to be converted to corresponding control instruction, the processor of smart machine refers to control Other units for being sent to smart machine are enabled, perform operation corresponding with control instruction.

Based on the scheme that above-mentioned steps S102 is limited to step S106, can know, by determining voice to be identified, from Effective acoustic sections are determined in multiple sound sections that voice to be identified includes, and is obtained according to the order word extracted in effective acoustic sections and waits to know The recognition result of other voice.

It is easily noted that, since voice to be identified includes multiple sound sections, order word may be included in multiple sound sections, Therefore, if only the order word in extraction sound section is likely to result in the misrecognition to voice, therefore, voice to be identified is being determined Comprising multiple sound sections after, from multiple sound sections determine effective acoustic sections, in effective acoustic sections order word carry out voice knowledge Not, it is possible to reduce voice is caused to misidentify the generation of phenomenon due to including order word in longer voice, so as to further carry The high accuracy rate of speech recognition.

The above is it is found that the present embodiment can achieve the purpose that accurately identify voice, it is achieved thereby that improving voice knowledge The technique effect of other accuracy rate, and then solve the technical issues of audio identification efficiency is low in the prior art.

In a kind of optional embodiment, determine that voice to be identified includes：

Step S1020 obtains mixing sound, wherein, mixing sound includes at least voice and non-voice；

Step S1022 extracts at least one voice using voiceprint recognition algorithm from mixing sound；

Step S1024 carries out classification processing at least one voice based on voiceprint recognition algorithm, obtains classification results；

Step S1026 determines voice to be identified from classification results.

Specifically, a kind of schematic diagram optionally classified to mixing sound as shown in Figure 2.In fig. 2, arrow The left side is, from the voice of people extracted in sound is mixed, includes two sound sources, the two sound sources may in the voice of the people Come from different users, wherein, black rectangle represents multiple sound sections in sound source 1, and white rectangle represents multiple in sound source 2 Sound section.Due to phonetic feature possessed by the voice of different user be it is different, can be special according to the voice of each user Sign distinguishes multiple voices, i.e., classification processing is carried out at least one voice, and classification results are the right of arrow in Fig. 2 Shown, as shown in Figure 2, sound source 1 is made of sound section 1, sound section 4,5 harmony section 6 of sound section, and sound source 2 is by sound section 2,3 harmony section 7 of sound section Composition.

In addition, it is not all with per family voice control can be carried out to smart machine, only meet the use of certain permission Family just can control smart machine, therefore, after at least one speech differentiation is come, can will distinguish obtained multiple voices Phonetic feature is matched with the phonetic feature to prestore in smart machine, if successful match, the voice of successful match is made For voice to be identified；If matching is unsuccessful, give up the unsuccessful voice of matching.

It should be noted that at least one voice can be extracted from mixing sound by way of machine learning, specifically Step is as follows：

Step S1022a analyzes mixing sound using preset model, determines at least one of mixing sound language Sound, wherein, preset model trains to obtain using multigroup voice data by machine learning, every group in multigroup voice data Data include：Mix the label of sound and the voice in mark mixing sound；

Step S1022b extracts determining at least one voice from mixing sound.

Specifically, at least one of above-mentioned mixing sound voice can be the voice from different user, wherein, it is different Phonetic feature possessed by the voice of user is also different.Therefore, after the mixing sound in getting noisy environment, Mixing sound is analyzed using beforehand through the preset model that machine learning is trained, determines voice in mixing sound Phonetic feature, for example, determining quantity, amplitude type of sound etc. of different frequency in mixing sound.In order to from sound In extract voice, neural network model can be established, the phonetic feature of multigroup mixing sound is obtained in advance, and pass through and manually mark The mode of note sets corresponding label for the phonetic feature in every group of mixing sound, then using the label set to creolized language Sound is trained, and obtains preset model.After preset model is built, input of the sound as preset model can will be mixed, and The output of preset model is at least one voice extracted from mixing sound.

It should be noted that at least one voice is divided by building disaggregated model using the method for machine learning Class processing, is as follows：

Step S1024a obtains at least one voice；

Step S1024b analyzes at least one voice using disaggregated model, determines every at least one voice The sound-type of a voice, wherein, disaggregated model trains to obtain using multigroup voice data by machine learning, Duo Zuyu Every group of voice data in sound data includes：The label of at least one voice and the sound-type at least one voice；

Step S1024c, the sound-type of each voice at least one voice divide at least one voice Class obtains classification results.

Specifically, after at least one voice is got, using the classification trained beforehand through machine learning Model analyzes at least one voice, determines the sound-type of each voice at least one voice, wherein, language can be passed through Phonetic feature possessed by sound determines the type belonging to voice, for example, phonetic feature possessed by child and adult voice Phonetic feature is different, therefore, can determine that the voice belongs to adult or child by phonetic feature.In order to according to voice Phonetic feature determine the sound-type belonging to voice, neural network model can be established, obtain the language of multigroup voice in advance Sound feature, and corresponding label is set for the sound-type in every group of voice by way of manually marking, then using setting Good label is trained voice, obtains disaggregated model.It, can be using multiple voices as classification mould after disaggregated model is built The input of type, and the output of disaggregated model is the sound-type corresponding to each voice.

It should be noted that after voice to be identified is determined, it is also necessary to the multiple languages included from voice to be identified Extract effective acoustic sections in segment, and the order word in effective acoustic sections controls smart machine.It is included from voice to be identified Sound section in determine effective acoustic sections, specifically comprise the following steps：

Step S1040 determines multiple sound sections in voice to be identified；

Multiple sound sections are divided at least by step S1042 according to the duration parameters of multiple sound sections that voice to be identified includes One independent sound section；

Step S1044 determines effective acoustic sections according to the duration of at least one independent sound section.

In a kind of optional embodiment, for user during voice is sent out, sentence will have pause, therefore, can root Voice to be identified is divided into multiple sound sections according to the time of pause, for example, a length of 10 seconds during the voice of user, preceding 2 seconds users do not have There is a pause, the 3rd second is the dead time, and 4-7 second are voice, are paused within the 8th second, and 9-10 seconds are voice, as a result, can will be to be identified Voice is divided into three sound sections, i.e. 1-2 seconds, 4-7 seconds and 9-10 seconds corresponding sound section.

In a kind of optional embodiment, after multiple sound sections in determining voice to be identified, according to voice to be identified Comprising the duration parameters of multiple sound sections multiple sound sections are divided at least one independent sound section, wherein, duration parameters are at least wrapped It includes one of following：At the beginning of sound section and the end time of sound section.The specific method for dividing independent sound section is as follows：

Step S1042a, at the beginning of determining the end time of the first sound section and rising tone section, wherein, the first sound section A upper sound section adjacent for rising tone section, also, the end time of the first sound section is less than at the beginning of rising tone section；

Step S1042b, the first time difference at the beginning of the end time of the first sound section of calculating and rising tone section；

Step S1042c in the case where first time difference is less than the first predetermined threshold value, determines the first sound section and second Sound section belongs to identical independent sound section；

Step S1042d in the case where first time difference is greater than or equal to the first predetermined threshold value, determines the first sound section Belong to different independent sound sections with rising tone section.

Now illustrated by taking Fig. 3 as an example, wherein, Fig. 3 shows a kind of schematic diagram for optionally dividing independent sound section.Fig. 3 Including 6 sound sections, respectively sound section 1, sound section 2, sound section 3, sound section 4,5 harmony section 6 of sound section, wherein, at the beginning of sound section 1 and End time is respectively t1 and t2, and at the beginning of sound section 2 and the end time is respectively t3 and t4, at the beginning of sound section 3 and End time is respectively t5 and t6, and at the beginning of sound section 4 and the end time is respectively t7 and t8, at the beginning of sound section 5 and End time is respectively t9 and t10, and at the beginning of sound section 6 and the end time is respectively t11 and t12.Assuming that the first default threshold It is worth for δ, due to t3-t2 ＜ δ, t5-t4 ＜ δ, therefore, sound section 1,2 harmony section 3 of sound section is divided into independent sound section 1；Due to t7- T6 ＞ δ, t9-t8 ＞ δ, therefore, sound section 4 are individually divided into an independent sound section, i.e., independent sound section 2；Due to t9-t8 ＞ δ, t11- Therefore 5 harmony section 6 of sound section, is divided into independent sound section 3 by t10 ＜ δ.

It is determining independent sound section and then effective acoustic sections is determined according to ready-portioned independent sound section, according at least one only The duration of vertical sound section determines that effective acoustic sections specifically comprise the following steps：

Step S1044a determines the duration of each independent sound section at least one independent sound section；

Step S1044b in the case where the duration of independent sound section is less than the second predetermined threshold value, determines independent sound Duan Weiyou Effect sound section.

Now illustrated by taking Fig. 4 as an example, wherein, Fig. 4 shows a kind of schematic diagram of optional determining effective acoustic sections.Fig. 4 Including 3 independent sound sections, respectively independent sound section 1, independent sound section 2 and independent sound section 3, wherein, at the beginning of independent sound section 1 It is respectively T1 and T2 with the end time, the duration λ 1=T2-T1 of independent sound section 1；At the beginning of independent sound section 2 and the end time Respectively T3 and T4, the duration λ 2=T4-T3 of independent sound section 2；At the beginning of independent sound section 3 and the end time be respectively T5 and T6, the duration λ 3=T6-T5 of independent sound section 3.Assuming that the second predetermined threshold value is λ, due to 13 ＞ λ of ＞ λ, λ of λ, and 2 ＜ λ of λ, therefore, Using independent sound section 2 as effective acoustic sections, independent sound section 1 and independent sound section 3 are non-effective sound section.

It should be noted that after the effective acoustic sections during voice to be identified is determined, extraction effective acoustic sections are included Order word, and smart machine is controlled to complete corresponding operation according to the order word.Wherein, according to extracting in effective acoustic sections Order word, the recognition result for obtaining voice to be identified specifically comprise the following steps：

Step S1060 obtains the order word in effective acoustic sections；

Step S1062 determines default sound bank control instruction corresponding with order word；

Step S1064, control device are completed and the corresponding operation of control instruction.

Specifically, after effective acoustic sections are got, the processor of smart machine is by using extraction model to effective sound Duan Jinhang is analyzed, and determines the order word in effective acoustic sections, wherein, extraction model is to pass through machine learning using multigroup effective acoustic sections What training obtained, every group of effective acoustic sections in multigroup effective acoustic sections include：It is ordered in effective acoustic sections and mark effective acoustic sections The label of word.After order word is extracted from effective acoustic sections, by the keyword progress in order word and default sound bank Match, determined if existed in default sound bank with the matched keyword of order word, basis with the matched keyword of order word Control instruction, and smart machine is controlled to complete corresponding operation according to control instruction；If it is not found in default sound bank The keyword to match with order word, then smart machine is without any operation.

Embodiment 2

According to embodiments of the present invention, a kind of device embodiment for identifying voice is additionally provided, wherein, Fig. 5 is according to this hair The apparatus structure schematic diagram of the identification voice of bright embodiment, as shown in figure 5, the device includes：First determining module 501, second Determining module 503 and acquisition module 505.

Wherein, the first determining module 501, for determining voice to be identified；Second determining module 503, for to be identified Effective acoustic sections are determined in multiple sound sections that voice includes；Acquisition module 505, for according to the order extracted in effective acoustic sections Word obtains the recognition result of voice to be identified.

It should be noted that above-mentioned first determining module 501, the second determining module 503 and acquisition module 505 correspond to Step S102 to step S106 in embodiment 1, three modules are identical with example and application scenarios that corresponding step is realized, But it is not limited to the above embodiments 1 disclosure of that.

In a kind of optional embodiment, the first determining module includes：First acquisition module, extraction module, the first processing Module and third determining module.Wherein, the first acquisition module mixes sound for obtaining, wherein, mixing sound includes at least Voice and non-voice；Extraction module, for extracting at least one voice from mixing sound using voiceprint recognition algorithm；First Processing module carries out classification processing at least one voice for being based on voiceprint recognition algorithm, obtains classification results；Third determines Module, for determining voice to be identified from classification results.

It should be noted that above-mentioned first acquisition module, extraction module, first processing module and third determining module pair The example and applied field that should be realized in the step S1020 in embodiment 1 to step S1026, four modules with corresponding step Scape is identical, but is not limited to the above embodiments 1 disclosure of that.

In a kind of optional embodiment, extraction module includes：4th determining module and the 5th determining module.Wherein, 4th determining module for being analyzed using preset model mixing sound, determines at least one of mixing sound voice, Wherein, preset model trains to obtain using multigroup voice data by machine learning, every group of number in multigroup voice data According to including：Mix the label of sound and the voice in mark mixing sound；5th determining module, for from mixing sound Extract determining at least one voice.

It should be noted that above-mentioned 4th determining module and the 5th determining module correspond to the step in embodiment 1 S1022a to step S1022b, two modules are identical with example and application scenarios that corresponding step is realized, but are not limited to State 1 disclosure of that of embodiment.

In a kind of optional embodiment, processing module includes：At second acquisition module, the 6th determining module and second Manage module.Wherein, the second acquisition module, for obtaining at least one voice；6th determining module, for using disaggregated model pair At least one voice is analyzed, and determines the sound-type of each voice at least one voice, wherein, disaggregated model is makes It is trained with multigroup voice data by machine learning, every group of voice data in multigroup voice data includes：At least The label of one voice and the sound-type at least one voice；Second processing module, for according at least one voice In the sound-type of each voice classify at least one voice, obtain classification results.

Implement it should be noted that above-mentioned second acquisition module, the 6th determining module and Second processing module correspond to Step S1024a to step S1024c in example 1, three modules are identical with example and application scenarios that corresponding step is realized, But it is not limited to the above embodiments 1 disclosure of that.

In a kind of optional embodiment, the second determining module includes：7th determining module, division module and the 8th are really Cover half block.Wherein, the 7th determining module, for determining multiple sound sections in voice to be identified；Division module is waited to know for basis Multiple sound sections are divided at least one independent sound section by the duration parameters of multiple sound sections that other voice includes；8th determining module, For determining effective acoustic sections according to the duration of at least one independent sound section.

It should be noted that above-mentioned 7th determining module, division module and the 8th determining module correspond in embodiment 1 Step S1040 to step S1044, the example and application scenarios that three modules and corresponding step are realized be identical but unlimited In 1 disclosure of that of above-described embodiment.

In a kind of optional embodiment, duration parameters include at least one of following：At the beginning of sound section and sound section End time, wherein, division module includes：9th determining module, computing module, the tenth determining module and the 11st determine Module.Wherein, the 9th determining module, at the beginning of determining the end time of the first sound section and rising tone section, wherein, A first sound section upper sound section adjacent for rising tone section, also, the end time of the first sound section is less than the beginning of rising tone section Time；Computing module, for the first time difference at the beginning of the end time of the first sound section of calculating and rising tone section；The Ten determining modules in the case of being less than the first predetermined threshold value in first time difference, determine the first sound section and rising tone section Belong to identical independent sound section；11st determining module, for being greater than or equal to the first predetermined threshold value in first time difference In the case of, determine that the first sound section and rising tone section belong to different independent sound sections.

It should be noted that above-mentioned 9th determining module, computing module, the tenth determining module and the 11st determining module Corresponding to the step S1042a in embodiment 1 to step S1042d, example that four modules and corresponding step are realized and should It is identical with scene, but it is not limited to the above embodiments 1 disclosure of that.

In a kind of optional embodiment, the 8th determining module includes：12nd determining module and the 13rd determining mould Block.Wherein, the 12nd determining module, for determining the duration of each independent sound section at least one independent sound section；13rd Determining module, in the case of being less than the second predetermined threshold value in the duration of independent sound section, it is effective acoustic sections to determine independent sound section.

It should be noted that above-mentioned 12nd determining module and the 13rd determining module correspond to the step in embodiment 1 Rapid S1044a to step S1044b, two modules are identical with example and application scenarios that corresponding step is realized, but are not limited to 1 disclosure of that of above-described embodiment.

In a kind of optional embodiment, acquisition module includes：Third acquisition module, the 14th determining module and control Module.Wherein, third acquisition module, for obtaining the order word in effective acoustic sections；14th determining module, it is default for determining Sound bank control instruction corresponding with order word；Control module is completed and the corresponding operation of control instruction for control device.

It should be noted that above-mentioned third acquisition module, the 14th determining module and control module correspond to embodiment 1 In step S1060 to step S1064, three modules are identical with example and application scenarios that corresponding step is realized, but not It is limited to 1 disclosure of that of above-described embodiment.

Embodiment 3

Another aspect according to embodiments of the present invention additionally provides a kind of storage medium, which includes storage Program, wherein, the method that program performs the identification voice in above-described embodiment 1.

Embodiment 4

Another aspect according to embodiments of the present invention additionally provides a kind of processor, which is used to run program, In, program performs the method for identifying voice in above-described embodiment 1 when running.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.

The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or Part steps.And aforementioned storage medium includes：USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code Medium.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. method for identifying voice, which is characterized in that including：

Determine voice to be identified；

Effective acoustic sections are determined in the multiple sound sections included from the voice to be identified；

According to the order word extracted in the effective acoustic sections, the recognition result of the voice to be identified is obtained.
2. according to the method described in claim 1, it is characterized in that, determine that voice to be identified includes：

Mixing sound is obtained, wherein, the mixing sound includes at least voice and non-voice；

At least one voice is extracted from the mixing sound using voiceprint recognition algorithm；

Classification processing is carried out at least one voice based on the voiceprint recognition algorithm, obtains classification results；

The voice to be identified is determined from the classification results.
3. it according to the method described in claim 2, it is characterized in that, is extracted from the mixing sound using voiceprint recognition algorithm Go out at least one voice to include：

The mixing sound is analyzed using preset model, determines at least one of mixing sound voice, wherein, The preset model trains to obtain using multigroup voice data by machine learning, every group in multigroup voice data Data include：The label of the mixing sound and the voice in the mark mixing sound；

Determining at least one voice is extracted from the mixing sound.
4. according to the method described in claim 2, it is characterized in that, based on the voiceprint recognition algorithm at least one language Sound carries out classification processing, obtains classification results and includes：

Obtain at least one voice；

At least one voice is analyzed using disaggregated model, determines each voice at least one voice Sound-type, wherein, the disaggregated model trains to obtain using multigroup voice data by machine learning, multigroup language Every group of voice data in sound data includes：Sound-type at least one voice and at least one voice Label；

The sound-type of each voice at least one voice classifies at least one voice, obtains The classification results.
5. it according to the method described in claim 1, it is characterized in that, is determined from the sound section that the voice to be identified includes effective Sound section, including：

Determine multiple sound sections in the voice to be identified；

According to the duration parameters of multiple sound sections that the voice to be identified includes by the multiple sound section be divided into it is at least one solely Vertical sound section；

The effective acoustic sections are determined according to the duration of at least one independent sound section.
6. according to the method described in claim 5, it is characterized in that, the duration parameters are including at least one of following：Sound section Time started and the end time of sound section, wherein, it will according to the duration parameters of multiple sound sections that the voice to be identified includes The multiple sound section is divided at least one independent sound section and includes：

At the beginning of determining the end time of the first sound section and rising tone section, wherein, the first sound section is described second The adjacent upper sound section of sound section, also, the end time of the first sound section is less than at the beginning of the rising tone section；

Calculate the first time difference at the beginning of end time of the first sound section and the rising tone section；

In the case where the first time difference is less than the first predetermined threshold value, the first sound section and the rising tone section are determined Belong to identical independent sound section；

In the case where the first time difference is greater than or equal to first predetermined threshold value, the first sound section and institute are determined State the independent sound section that rising tone section belongs to different.
7. according to the method described in claim 5, it is characterized in that, institute is determined according to the duration of at least one independent sound section Effective acoustic sections are stated to include：

Determine the duration of each independent sound section at least one independent sound section；

In the case where the duration of the independent sound section is less than the second predetermined threshold value, it is effective sound to determine the independent sound section Section.
8. according to the method described in claim 1, it is characterized in that, according to the order word extracted in the effective acoustic sections, obtain The recognition result of the voice to be identified is taken to include：

Obtain the order word in the effective acoustic sections；

Determine default sound bank control instruction corresponding with the order word；

Control device is completed and the corresponding operation of the control instruction.
9. a kind of device for identifying voice, which is characterized in that including：

First determining module, for determining voice to be identified；

Second determining module, for determining effective acoustic sections from multiple sound sections that the voice to be identified includes；

Acquisition module, for according to the order word extracted in the effective acoustic sections, obtaining the identification knot of the voice to be identified Fruit.
10. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein, described program right of execution The method that profit requires the identification voice described in any one in 1 to 8.
11. a kind of processor, which is characterized in that the processor is used to run program, wherein, right of execution when described program is run The method that profit requires the identification voice described in any one in 1 to 8.