CN107301862A

CN107301862A - A kind of audio recognition method, identification model method for building up, device and electronic equipment

Info

Publication number: CN107301862A
Application number: CN201610203791.7A
Authority: CN
Inventors: 马腾; 李良
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2016-04-01
Filing date: 2016-04-01
Publication date: 2017-10-27

Abstract

The present invention relates to artificial intelligence field, a kind of audio recognition method, identification model method for building up, device and electronic equipment are disclosed, to solve technical problem in the prior art for the not enough speech data None- identified of speech intelligibility, this method includes：The speech data that user produces is obtained, the user is that the user is the user that speech intelligibility is less than predetermined definition；Extract the first phonetic feature of the speech data；Speech recognition modeling based on pre-set user colony determines the first semanteme that first phonetic feature is characterized, and the pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.The technique effect that the speech data that the user that can be less than predetermined definition to speech intelligibility produces effectively is recognized is reached.

Description

A kind of audio recognition method, identification model method for building up, device and electronic equipment

Technical field

The present invention relates to artificial intelligence field, more particularly to a kind of audio recognition method, identification model method for building up, device And electronic equipment.

Background technology

With the continuous development of scientific technology, electronic technology also obtains development at full speed, and the species of electronic product is also more next More, people also enjoy the various facilities that development in science and technology is brought.Present people can be enjoyed by various types of electronic equipments By the comfortable life brought with development in science and technology.For example, the electronic equipment such as smart mobile phone, tablet personal computer has become people's life In an important part.

Many electronic equipments all possess speech identifying function, the speech data that user produces can be converted into textual data According to, so save user's input text data time, however, electronic equipment recognize speech data when, it is necessary to ensure to speak User to tell the more clear user of word, otherwise None- identified, for example：For Baby language its with regard to None- identified, Baby language refer to child's language of milk sound milk gas, although lovely and melodious, but sound and be difficult to distinguish.That is, The technical problem for the not enough speech data None- identified of speech intelligibility is there is in the prior art.

The content of the invention

The present invention provides a kind of audio recognition method, identification model method for building up, device and electronic equipment, existing to solve For the technical problem of the not enough speech data None- identified of speech intelligibility in technology.

In a first aspect, the embodiment of the present invention provides a kind of audio recognition method, including：

The speech data that user produces is obtained, the user is that the user is that speech intelligibility is less than predetermined definition User；

Extract the first phonetic feature of the speech data；

Speech recognition modeling based on pre-set user colony determines the first semanteme that first phonetic feature is characterized, The pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.

Optionally, the speech intelligibility is less than the user of predetermined definition, including：Tell the ambiguous user of word, speak Lou The user of wind, the user of lisper, at least one of the user that word can only pronounce user can not be told.

Optionally, the speech recognition modeling of the pre-set user colony is identified by the following manner foundation：

For every kind of pre-set user colony, at least one sample user is determined；

Collection obtains the voice sample of at least one sample user, wherein, mark its language for each voice sample Justice；

The speech samples included based on each semanteme determine the semantic phonetic feature of correspondence, association correspondence phonetic feature with Semanteme can obtain the speech recognition modeling.

Optionally, the pre-set user colony includes M kind user groups, and M is positive integer, the language of the pre-set user colony Include and included in the identification model of M kind user groups, the identification model of every kind of user group in sound identification model：Correspondence customer group Phonetic feature and semantic corresponding relation under body；Or

The pre-set user colony includes M kind user groups, in the speech recognition modeling identification of the pre-set user colony Include the semantic corresponding relation with the phonetic feature of each user group in M kind user groups.

Optionally, the first phonetic feature institute table is determined in the speech recognition modeling based on pre-set user colony After the first semanteme levied, methods described also includes：

First semanteme of first phonetic feature is supplied to pre-set user, so that the pre-set user judges institute Whether state the first phonetic feature described first be semantic accurate；

Obtain the pre-set user assert described first it is semantic inaccurate when, the provided for first phonetic feature Two is semantic；

In the speech recognition modeling, the semanteme of first phonetic feature is replaced with by second semanteme described First is semantic.

Optionally, the speech recognition modeling based on pre-set user colony determines that first phonetic feature is characterized It is first semantic, including：

The phonetic feature is divided at least one phonetic feature fragment；

By in each phonetic feature fragment at least one described phonetic feature fragment and the speech recognition modeling Phonetic feature is matched, and then identifies the semanteme of each phonetic feature fragment；

The semanteme of each phonetic feature fragment obtains the phonetic feature at least one comprehensive described phonetic feature fragment Described first characterized is semantic.

Optionally, the phonetic feature includes：Frequency range and/or frequency shape.

Optionally, the first phonetic feature institute table is identified in the speech recognition modeling based on pre-set user colony After the first semanteme levied, methods described also includes：

By the described first semantic transmission to the electronic equipment where pre-set user；Or

Judge whether include predetermined keyword in first semanteme；When comprising the predetermined keyword, by described One semanteme is sent to the electronic equipment where pre-set user.

Second aspect, the embodiment of the present invention provides a kind of speech recognition modeling method for building up, including：

For pre-set user colony, at least one sample user is determined, the pre-set user colony is that speech intelligibility is small The colony belonging to user in predetermined definition；

The speech samples included based on each semanteme determine the semantic phonetic feature of correspondence, association correspondence phonetic feature with The semantic and then acquisition speech recognition modeling.

Optionally, it is described after the association correspondence phonetic feature with the semantic and then acquisition speech recognition modeling Method also includes：

After the speech data that user produces is obtained, the speech recognition is known by the speech recognition modeling Not, the first semantic, user of the user for speech intelligibility less than predetermined definition that the speech data is characterized is obtained；

The semanteme of first phonetic feature is replaced with into first semanteme by second semanteme.

The third aspect, the embodiment of the present invention provides a kind of speech recognition equipment, including：

First obtains module, and the speech data for obtaining user's generation, the user is speech intelligibility less than predetermined The user of definition；

Extraction module, the first phonetic feature for extracting the speech data；

First determining module, first phonetic feature is determined for the speech recognition modeling based on pre-set user colony First characterized is semantic, and the pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.

Fourth aspect, the embodiment of the present invention provides a kind of speech recognition modeling and sets up device, including：

3rd determining module, for for pre-set user colony, determining at least one sample user, the pre-set user group Body is speech intelligibility less than the colony belonging to the user of predetermined definition；

Second acquisition module, the voice sample of at least one sample user is obtained for gathering, wherein, for each It is semantic that voice sample marks it；

3rd obtains module, and the speech samples for being included based on each semanteme determine the semantic phonetic feature of correspondence, Association correspondence phonetic feature and the semantic and then acquisition speech recognition modeling.

5th aspect, the embodiment of the present invention provides a kind of electronic equipment, includes memory, and one or one with On program, one of them or more than one program storage is configured to by one or more than one in memory Computing device is one or more than one program bag contains the instruction for being used for being operated below：

The speech data that user produces is obtained, the user is the user that speech intelligibility is less than predetermined definition；

Extract the first phonetic feature of the speech data；

6th aspect, the embodiment of the present invention provides a kind of electronic equipment, includes memory, and one or one with On program, one of them or more than one program storage is configured to by one or more than one in memory Computing device is one or more than one program bag contains the instruction for being used for being operated below：

The present invention has the beneficial effect that：

Due in embodiments of the present invention, speech data being produced obtaining user of the speech intelligibility less than predetermined definition Afterwards, the first phonetic feature of speech data can be extracted, the speech recognition modeling for being then based on pre-set user colony is determined The first semanteme that first phonetic feature is characterized, the pre-set user colony is that speech intelligibility is less than predetermined definition Colony belonging to user, that is, reach the voice that can be produced by the speech recognition modeling of pre-set user colony to user The technique effect that data are identified, so as to be less than the speech data that the user of predetermined definition produces to speech intelligibility Effectively recognized.

Brief description of the drawings

Fig. 1 is the flow chart of audio recognition method in the embodiment of the present invention；

Fig. 2 is sets up the flow chart of speech recognition modeling in the audio recognition method of the embodiment of the present invention；

Fig. 3 is the semantic flow chart of determination first in the audio recognition method of the embodiment of the present invention；

Fig. 4 is the flow chart of speech recognition modeling method for building up in the embodiment of the present invention；

Fig. 5 is the structure chart of speech recognition equipment in the embodiment of the present invention；

Fig. 6 sets up the structure chart of device for speech recognition modeling in the embodiment of the present invention；

Fig. 7 is that being used for according to an exemplary embodiment implements a kind of audio recognition method or speech recognition modeling is built The block diagram of the electronic equipment of cube method；

Fig. 8 is the clothes for being used in the embodiment of the present invention implement a kind of audio recognition method or speech recognition modeling method for building up The structural representation of business device.

Embodiment

The present invention provides a kind of audio recognition method, identification model method for building up, device and electronic equipment, existing to solve The technical problem of the speech data None- identified of the not enough user of speech intelligibility is directed in technology.

Technical scheme in the embodiment of the present application is solves above-mentioned technical problem, and general thought is as follows：

After user generation speech data of the speech intelligibility less than predetermined definition is obtained, speech data can be extracted The first phonetic feature, the speech recognition modeling for being then based on pre-set user colony determines that first phonetic feature is characterized It is first semantic, the pre-set user colony is that speech intelligibility is less than colony belonging to the user of predetermined definition, that is, The technology effect that the speech data that having reached can be produced by the speech recognition modeling of pre-set user colony to user is identified Really, the speech data produced so as to be less than the user of predetermined definition to speech intelligibility is effectively recognized.

In order to be better understood from above-mentioned technical proposal, below by accompanying drawing and specific embodiment to technical solution of the present invention It is described in detail, it should be understood that the specific features in the embodiment of the present invention and embodiment are to the detailed of technical solution of the present invention Thin explanation, rather than the restriction to technical solution of the present invention, in the case where not conflicting, the embodiment of the present invention and embodiment In technical characteristic can be mutually combined.

In a first aspect, the embodiment of the present invention provides a kind of audio recognition method, Fig. 1 is refer to, including：

Step S101：The speech data that user produces is obtained, the user is that speech intelligibility is less than predetermined definition User；

Step S102：Extract the first phonetic feature of the speech data；

Step S103：Speech recognition modeling based on pre-set user colony determines what first phonetic feature was characterized First is semantic, and the pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.

For example, the program is applied to the electronic equipment for possessing speech identifying function, and the electronic equipment can be client End equipment, for example：Mobile phone, tablet personal computer, notebook computer, bracelet, child intelligence wrist-watch etc.；The electronic equipment can also be Server, the embodiment of the present invention is not restricted.

In step S101, if the program is applied to client device, the client device can by carrying or External voice acquisition device collection obtains the speech data that user produces, and can also receive other client devices and send extremely Speech data；If the program is applied to server, server can receive the client device being attached thereto and send extremely Speech data.

For example, user A wears a Wearable (for example：Children's wrist-watch), user B is (for example：User A father and mother, parent People etc.) mobile phone and the Wearable there is data cube computation, then the Wearable can gather the voice for obtaining user A Data, then identify that first corresponding to the speech data is semantic, then in Wearable by speech recognition modeling By the first semantic transmission to mobile phone, in this case, the program is applied to client device (Wearable), and client is set It is standby that speech data is directly obtained by audio collecting device collection；Or, Wearable obtains user A voice in collection After data, directly send it to server, the speech data identified by speech recognition modeling as server corresponding to It is first semantic, be then sent to user B, in this case, the program is applied to server, and server receives wearing The speech data that formula equipment is sent；Or, Wearable collection is obtained after user A speech data, is sent it to Mobile phone (is sent by short-distance wireless transmission mode or forwarded by server), and mobile phone is recognized by speech recognition modeling Go out the first semanteme corresponding to the speech data, in this case, the program is applied to client device (mobile phone), client Equipment is by receiving the speech data of other client devices (Wearable) transmission extremely.

Predetermined definition is, for example,：Told when speaking word understand, it is clear corresponding to user that people around can be allowed to be distinguished Degree.Speech articulation (language transmission index；The intelligibility of speech) it is the physical quantity for weighing teller's voice intelligibility. (STI/RASTI/STIPA is the speech articulation result using different measuring methods.) according to relevant criterion, sent by speaker Linguistic unit (sentence, word or syllable), through language transmission system, investigates the ratio that music-listener correctly recognizes, its result is language Definition.

Wherein, the user that speech intelligibility is less than predetermined definition can be a variety of users, for example：Tell the ambiguous use of word Family, the user leaked out that speaks, the user of lisper, user that word can only pronounce etc. can not be told.Wherein, baby speaks milk sound milk Gas, it is ambiguous or can not tell word and can only pronounce often to tell word, so generally baby belong to speech intelligibility be less than it is predetermined clear The user of degree.Under normal circumstances, these users can have the problem of cacoepy is true, so as to lead when producing speech data Cause its definition relatively low, it is impossible to be recognized by electronic equipment, it could even be possible to can not be recognized by other users.Step S102 In, for speech data, its phonetic feature can include at least one of spectral range and frequency shape feature.

In step S103, Fig. 2 is refer to, the speech recognition modeling of pre-set user colony can be obtained in the following manner：

Step S201：For every kind of pre-set user colony, at least one sample user is determined；

Step S202：Collection obtains the voice sample of at least one sample user, wherein, for each voice sample Mark it semantic；

Step S203：The semantic phonetic feature of correspondence, association correspondence are determined based on the speech samples that each semanteme is included Phonetic feature can obtain the speech recognition modeling with semanteme.

In step S201, pre-set user colony can be a variety of different user groups, for example：It can be divided into：Tell The ambiguous user group of word, the user group leaked out that speaks, the user group of lisper, it word can not be told can only pronounce (for example：Baby Youngster) user group etc..

For same semanteme, phonetic feature that different user group produces simultaneously is differed, so that, know setting up voice During other code model, for same semanteme, its corresponding phonetic feature can be obtained respectively in each user group.Specific to step In S201, every kind of pre-set user colony can be directed to, at least one sample user is all obtained respectively, for example：It is ambiguous for telling word User group can obtain such as 100 sample users, the user group leaked out for speaking can obtain such as 100 use Family, can obtain such as 100 users, the user group that can only pronounce for that can not tell word can for the user group of lisper To obtain such as 100 users etc..

In step S202, for each sample user, after the voice sample of the sample user is obtained, sample collector can The assistance data provided with combining the environment and other users of collection speech data determines the semanteme of the voice sample, other User be usually the acquaintance of sample user (for example：Father and mother, relatives or the corresponding doctor of the colony of baby or scientific research personnel) or Person's sample user is in itself etc..Then, then voice sample can be marked by the semanteme of acquisition.

Wherein, for basic voice sample (for example：Starting stage obtains voice sample) can be directly by above-mentioned It is semantic that the mode of handmarking marks it；After having certain basis, machine automatic marking speech samples can also be passed through Semanteme, if some speech samples can not be marked by machine mode, its is semantic, can turn again to manual type mark.

In step S203, speech recognition modeling is by knowing to the voice sample that the user for presetting feature colony produces Not, and to match its corresponding semantic so as to obtaining speech recognition modeling.

In step S203, for every kind of pre-set categories colony, at least one voice corresponding to each implication can be obtained Sample, for example：There is speech samples A, speech samples B, speech samples C in the user group ambiguous for telling word, semantic " having a meal ", Speech samples A, speech samples B and speech samples C phonetic feature can be then extracted respectively, then to the phonetic feature of this three Integrated, it is possible to obtain the phonetic feature of semantic " having a meal ".

If for example, phonetic feature is frequency range, it is assumed that speech samples A frequency range is：500Hz~800Hz, language The frequency range of sound sample B is 550Hz~900Hz, and speech samples C frequency range is 450Hz~700Hz, then can be to this The frequency range of three is overlapped, so as to obtain the frequency range of semantic " having a meal ", such as：450Hz~900Hz, certainly, also The phonetic feature of semanteme can be otherwise determined, for example：Take common factor, average etc., the embodiment of the present invention is no longer detailed Carefully enumerate, and be not restricted.

In specific implementation process, in step S203, it may be determined that go out the speech recognition modeling of diversified forms, be set forth below Two kinds therein are introduced, certainly, in specific implementation process, are not limited to following two situations.

The first, the pre-set user colony includes M kind user groups, and M is positive integer, the language of the pre-set user colony Include and included in the identification model of M kind user groups, the identification model of every kind of user group in sound identification model：Correspondence customer group Phonetic feature and semantic corresponding relation under body.

As an example it is assumed that pre-set user colony includes four kinds of user groups, it is respectively：Tell the ambiguous user group of word, The user group of lisper, the user group leaked out that speaks, the user group that word can only pronounce can not be told, then can set up language During sound identification model, speech recognition modeling is set up respectively for these four user groups, for example：Assuming that for semantic " having a meal " four Phonetic feature is planted, is respectively：Phonetic feature A (correspondence tells the ambiguous user group of word), phonetic feature B are (corresponding to speak what is leaked out User group), the phonetic feature C user group of lisper (correspondence), (correspondence can not tell the use that word can only pronounce to phonetic feature D Family colony), then when setting up speech recognition modeling, each classification can be divided by the form of table 1：

Table 1

So that after speech data is obtained, the user group belonging to the user is obtained first, then using the use The voice is identified the speech recognition modeling of family owning user colony.Wherein it is possible to voluntarily be set by the user of electronic equipment Determine the user group belonging to user, can also be by carrying out feature recognition to speech data, and then identify the use belonging to user Family colony, the embodiment of the present invention is not restricted.

Based on the program, it is only necessary to which the first phonetic feature is matched with a kind of speech recognition modeling of user group, Thus, it is possible to improve recognition efficiency.

Second, the pre-set user colony includes M kind user groups, the speech recognition modeling of the pre-set user colony The semantic corresponding relation with the phonetic feature of each user group in M kind user groups is included in identification.

For example, for semantic " having a meal ", can for tell the ambiguous user group of word, the user group of lisper, Its corresponding phonetic feature is respectively associated in the speak user group leaked out, the user group that can not tell word and can only pronounce, so that, lead to Cross the first phonetic feature match its it is corresponding first it is semantic when, by the first phonetic feature and the phonetic feature of all user groups all Matched, for example：Corresponding relation as shown in table 2 can be set up：

Table 2

In step S203, the speech recognition modeling based on pre-set user colony determines the first phonetic feature institute First characterized is semantic, refer to Fig. 3, including：

Step S301：The phonetic feature is divided at least one phonetic feature fragment；

Step S302：By each phonetic feature fragment at least one described phonetic feature fragment and the speech recognition Phonetic feature in model is matched, and then identifies the semanteme of each phonetic feature fragment；

Step S303：The semantic of each phonetic feature fragment obtains described at least one comprehensive described phonetic feature fragment First semanteme that phonetic feature is characterized.

In step S301, it can reconcile length to divide phonetic feature based on the word of each word of phonetic feature, enter And obtain at least one phonetic feature fragment.

In step S302, for each phonetic feature fragment, it can be entered with the phonetic feature in speech recognition modeling A row matching, if the phonetic feature in the phonetic feature fragment and speech recognition modeling corresponding to some semanteme can succeed Match somebody with somebody, then can this correspondence phonetic feature semanteme as the semantic feature fragment semanteme, for example：For some phonetic feature piece Section if it is determined that it match with the phonetic feature E in speech recognition modeling, then can determine the phonetic feature fragment it is right The semanteme answered is：Sleep.

, can be semantic according to user's generation language by each after the semanteme of each characteristic fragment is obtained in step S303 The time order and function order of sound characteristic fragment is arranged, and then it is semantic to obtain first.

In specific implementation process, after the first semanteme that the first phonetic feature is characterized is obtained based on step S103, The first semanteme can also be exported, can be semantic by number of ways output first, it is set forth below two kinds therein and is introduced, when So, in specific implementation process, following two situations are not limited to.

The first, the output equipment output first carried by current electronic device is semantic.

As an example it is assumed that baby wears intelligent watch (current electronic device), if the father and mother of baby are aside Words, in order to ensure that its father and mother can understand the first semantic, then output equipment that can be carried by intelligent watch that baby speaks Output first is semantic, for example：First sound output dress semantic, by smart machine is shown by the display unit of smart machine Put first semanteme of output etc..

In another example, baby wears intelligent watch, and the father and mother of baby are by mobile phone (current electronic device) to intelligent watch It is controlled, wherein intelligent watch sends it to mobile phone, mobile phone is in identification after the speech data of baby's generation is collected Go out after the first semanteme that speech data is characterized, can directly export the first semanteme, its way of output is similar with intelligent watch, Repeat no more.

Second, by the first semantic transmission to the electronic equipment where pre-set user, and then by where pre-set user Electronic equipment output first is semantic.

For example, pre-set user is, for example,：There is the user of close relationship (for example in active user and active user：When The father and mother of preceding user, relatives etc.) or doctor scientific research personnel etc.,.For example：Baby wears intelligent watch, and the father and mother of baby are led to Cross mobile phone (current electronic device) to be controlled intelligent watch, intelligent watch, can after the speech data of baby is collected It is first semantic corresponding to the speech data of baby to identify, but other electronic equipments such as mobile phone or PC are sent it to, So as to semantic by the output first of other electronic equipments such as mobile phone or PC, it can be exported by the display unit or sound of mobile phone Device output first is semantic., can be defeated by the sound of other electronic equipments such as mobile phone or PC as a kind of optional embodiment Go out speech data, display unit the first language of output by other electronic equipments such as mobile phone or PC that device output user produces Justice, so as to allow pre-set user to be learnt based on the speech data and the first semanteme.

Wherein, current electronic device by first it is semantic send to electronic equipment where pre-set user when, can be direct Send, can also first determine whether whether include predetermined keyword in first semanteme；, will when comprising the predetermined keyword First semanteme is sent to the electronic equipment where pre-set user.

For example, predetermined keyword is, for example,：There is the keyword of demand to pre-set user, for example：Assuming that current use The baby at family, then predetermined keyword for example, " having a meal ", " thirsty ", " hungry ", " stool, urine " etc.；If active user is voice The adult that definition is not enough, then predetermined keyword be, for example,：" help ", " seeking help ", " stool, urine " etc., the present invention are implemented Example is not restricted.

As a kind of optional embodiment, the first semanteme of the first phonetic feature is being identified based on speech recognition modeling Afterwards, be also based on the feedback of pre-set user, the semanteme of the first phonetic feature be modified, its specific makeover process include with Lower step：First semanteme of first phonetic feature is supplied to pre-set user, so that the pre-set user judges institute Whether state the first phonetic feature described first be semantic accurate；Obtain the pre-set user and assert that described first is semantic inaccurate When, the second semanteme provided for first phonetic feature；In the speech recognition modeling, by first phonetic feature It is semantic that first semanteme is replaced with by second semanteme.

As an example it is assumed that by the way that the phonetic feature that user produces is identified, determining its corresponding first semanteme For：Have a meal, then it can be supplied to pre-set user by current electronic device or send it to the electricity of pre-set user Sub- equipment, so as to be supplied to pre-set user；Pre-set user is after the speech data that user produces is heard, if feeling the first language Justice has no problem, and can produce the feedback information for determining that first semanteme has no problem；If pre-set user thinks that first is semantic It is problematic, then the feedback information corrected to the semanteme of speech data can be produced, for example：It is special to current electronic device voice Levy corresponding correct semanteme (for example：Second is semantic), so as to semantic replace the based on second in speech recognition modeling One is semantic, and the corrigendum to speech recognition modeling is realized with this.

As a kind of optional embodiment, for that with short, may there is different semantemes under different scenes. So when setting up speech recognition modeling, there may be a variety of semantemes, every kind of semantic and and user for same phonetic feature The characteristic for producing speech data is related, and this feature data are, for example,：Surrounding enviroment, user action, user's tone etc., enter And in step S103, the word content expressed by the first phonetic feature can be determined by the first phonetic feature first, then The characteristic of user is obtained, and then the semanteme corresponding to the word content is obtained by characteristic.

For example：In speech recognition modeling, there is corresponding relation as shown in table 3 for word content " I will not "：

Table 3

And then, if determining that the word content that speech data is characterized is " I will not " by the first phonetic feature, User action can be obtained by camera collection, if it is determined that go out user action for " falling thing ", then it is assumed that the user may It is angry, so needs carry out some to the user and pacify action.

Based on such scheme, the skill that can be identified for the implication of the speech data produced by user has been reached Art effect.

Second aspect, based on same inventive concept, the embodiment of the present invention provides speech recognition modeling method for building up, refer to Fig. 4, including：

Step S401：For pre-set user colony, at least one sample user is determined, the pre-set user colony is voice Definition is less than the colony belonging to the user of predetermined definition；

Step S402：Collection obtains the voice sample of at least one sample user, wherein, for each voice sample Mark it semantic；

Step S403：The semantic phonetic feature of correspondence, association correspondence are determined based on the speech samples that each semanteme is included Phonetic feature and the semantic and then acquisition speech recognition modeling.

Due to specifically how to set up speech recognition modeling, it has been described in first aspect of the embodiment of the present invention, so herein Repeat no more, the speech recognition modeling that every first aspect of the embodiment of the present invention is used sets up mode and is suitable for the present invention in fact Apply a second aspect.

The third aspect, based on same inventive concept, the embodiment of the present invention provides a kind of speech recognition equipment, refer to Fig. 5, Including：

First obtains module 50, the speech data for obtaining user's generation, and the user is that speech intelligibility is less than in advance Determine the user of definition；

Extraction module 51, the first phonetic feature for extracting the speech data；

First determining module 52, determines that first voice is special for the speech recognition modeling based on pre-set user colony The first characterized semanteme is levied, the pre-set user colony is speech intelligibility less than the group belonging to the user of predetermined definition Body.

Optionally, described device also includes：

Second determining module, for for every kind of pre-set user colony, determining at least one sample user；

First acquisition module, the voice sample of at least one sample user is obtained for gathering, wherein, for each It is semantic that voice sample marks it；

Second obtains module, and the speech samples for being included based on each semanteme determine the semantic phonetic feature of correspondence, Association correspondence phonetic feature can obtain the speech recognition modeling with semanteme.

Optionally, described device also includes：

First provides module, for first semanteme of first phonetic feature to be supplied into pre-set user, for The pre-set user judges whether first semanteme of first phonetic feature is accurate；

First acquisition module, for obtain the pre-set user assert described first it is semantic inaccurate when, be described the The second semanteme that one phonetic feature is provided；

First replacement module, in the speech recognition modeling, by the semanteme of first phonetic feature by described It is semantic that second semanteme replaces with described first.

Optionally, first determining module 52, including：

Division unit, for the phonetic feature to be divided into at least one phonetic feature fragment；

Matching unit, for by each phonetic feature fragment at least one described phonetic feature fragment and the voice Phonetic feature in identification model is matched, and then identifies the semanteme of each phonetic feature fragment；

Comprehensive unit, is obtained for the semantic of each phonetic feature fragment at least one comprehensive described phonetic feature fragment First semanteme that the phonetic feature is characterized.

Optionally, described device also includes：

Sending module, for the described first semanteme to be sent to the electronic equipment where pre-set user；Or

By the speech recognition equipment that the third aspect of the embodiment of the present invention is introduced, to implement first party of the embodiment of the present invention The device that the audio recognition method in face is used, so the speech recognition side introduced based on first aspect of the embodiment of the present invention Method, those skilled in the art can understand concrete structure and the deformation of the device, so will not be repeated here, it is every to implement The device that the audio recognition method that first aspect of the embodiment of the present invention is introduced is used belongs to the embodiment of the present invention and is intended to protect The scope of shield.

Fourth aspect, based on same inventive concept, the embodiment of the present invention provides a kind of speech recognition modeling and sets up device, please With reference to Fig. 6, including：

3rd determining module 60, for for pre-set user colony, determining at least one sample user, the pre-set user Colony is speech intelligibility less than the colony belonging to the user of predetermined definition；

Second acquisition module 61, the voice sample of at least one sample user is obtained for gathering, wherein, for every It is semantic that individual voice sample marks it；

3rd obtains module 62, and the speech samples for being included based on each semanteme determine that the semantic voice of correspondence is special Levy, association correspondence phonetic feature and the semantic and then acquisition speech recognition modeling.

Optionally, described device also includes：

Identification module, for after the speech data that user produces is obtained, by the speech recognition modeling to described Speech recognition is identified, and obtains the first semanteme that the speech data is characterized, the user is speech intelligibility less than predetermined The user of definition；

Second provides module, for first semanteme of first phonetic feature to be supplied into pre-set user, for The pre-set user judges whether first semanteme of first phonetic feature is accurate；

Second acquisition module, for obtain the pre-set user assert described first it is semantic inaccurate when, be described the The second semanteme that one phonetic feature is provided；

Second replacement module, for the semanteme of first phonetic feature to be replaced with into described first by second semanteme It is semantic.

By the speech recognition modeling that fourth aspect of the embodiment of the present invention is introduced sets up device, implement to implement the present invention The device that the speech recognition modeling method for building up of example second aspect is used, so be situated between based on second aspect of the embodiment of the present invention The speech recognition modeling method for building up continued, those skilled in the art can understand concrete structure and the deformation of the device, therefore And will not be repeated here, the speech recognition modeling method for building up that every implementation second aspect of the embodiment of the present invention is introduced is used Device belong to the scope to be protected of the embodiment of the present invention.

5th aspect, based on same inventive concept, the embodiment of the present invention provides a kind of electronic equipment, includes memory, And one or more than one program, one of them or more than one program storage is configured in memory By one or more than one computing device is one or more than one program bag contains the finger for being used for being operated below Order：

Extract the first phonetic feature of the speech data；

By the electronic equipment that the aspect of the embodiment of the present invention the 5th is introduced, to implement first aspect of the embodiment of the present invention The electronic equipment that audio recognition method is used, so the speech recognition side introduced based on first aspect of the embodiment of the present invention Method, those skilled in the art can understand concrete structure and the deformation of the electronic equipment, so will not be repeated here, it is every The electronic equipment that the audio recognition method that implementation first aspect of the embodiment of the present invention is introduced is used belongs to implementation of the present invention The scope to be protected of example.

6th aspect, based on same inventive concept, the embodiment of the present invention provides a kind of electronic equipment, includes memory, And one or more than one program, one of them or more than one program storage is configured in memory By one or more than one computing device is one or more than one program bag contains the finger for being used for being operated below Order：

By the electronic equipment that the aspect of the embodiment of the present invention the 6th is introduced, to implement second aspect of the embodiment of the present invention The electronic equipment that speech recognition modeling method for building up is used, so the voice introduced based on second aspect of the embodiment of the present invention Identification model method for building up, those skilled in the art can understand concrete structure and the deformation of the electronic equipment, so This is repeated no more, the electricity that the speech recognition modeling method for building up that every implementation second aspect of the embodiment of the present invention is introduced is used Sub- equipment belongs to the scope to be protected of the embodiment of the present invention.

Fig. 7 is that (or a kind of speech recognition modeling is built for a kind of audio recognition method according to an exemplary embodiment Cube method) electronic equipment 800 block diagram.For example, electronic equipment 800 can be mobile phone, computer, digital broadcasting is whole End, messaging devices, game console, tablet device, Medical Devices, body-building equipment, personal digital assistant, bracelet, children Wrist-watch etc..

Reference picture 7, electronic equipment 800 can include following one or more assemblies：Processing assembly 802, memory 804, Power supply module 806, multimedia groupware 808, audio-frequency assembly 810, the interface 812 of input/output (I/O), sensor cluster 814, And communication component 816.

The integrated operation of the usual control electronics 800 of processing assembly 802, such as with display, call, data are led to Letter, the camera operation operation associated with record operation.Treatment element 802 can include one or more processors 820 to hold Row instruction, to complete all or part of step of above-mentioned method.In addition, processing assembly 802 can include one or more moulds Block, is easy to the interaction between processing assembly 802 and other assemblies.For example, processing component 802 can include multi-media module, with Facilitate the interaction between multimedia groupware 808 and processing assembly 802.

Memory 804 is configured as storing various types of data supporting the operation in equipment 800.These data are shown Example includes the instruction of any application program or method for being operated on electronic equipment 800, contact data, telephone directory number According to, message, picture, video etc..Memory 804 can by any kind of volatibility or non-volatile memory device or they Combination realize that such as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM) is erasable Programmable read only memory (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, quick flashing Memory, disk or CD.

Electric power assembly 806 provides electric power for the various assemblies of electronic equipment 800.Electric power assembly 806 can include power supply pipe Reason system, one or more power supplys, and other components associated with generating, managing and distributing electric power for electronic equipment 800.

Multimedia groupware 808 is included in the screen of one output interface of offer between the electronic equipment 800 and user. In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch-screen, to receive the input signal from user.Touch panel includes one or more touch Sensor is with the gesture on sensing touch, slip and touch panel.The touch sensor can not only sensing touch or slip The border of action, but also the detection duration related to the touch or slide and pressure.In certain embodiments, Multimedia groupware 808 includes a front camera and/or rear camera.When electronic equipment 800 is in operator scheme, such as clap When taking the photograph pattern or video mode, front camera and/or rear camera can receive the multi-medium data of outside.It is each preposition Camera and rear camera can be a fixed optical lens systems or with focusing and optical zoom capabilities.

Audio-frequency assembly 810 is configured as output and/or input audio signal.For example, audio-frequency assembly 810 includes a Mike Wind (MIC), when electronic equipment 800 is in operator scheme, when such as call model, logging mode and speech recognition mode, microphone It is configured as receiving external audio signal.The audio signal received can be further stored in memory 804 or via logical Letter component 816 is sent.In certain embodiments, audio-frequency assembly 810 also includes a loudspeaker, for exports audio signal.

I/O interfaces 812 is provide interface between processing assembly 802 and peripheral interface module, above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor cluster 814 includes one or more sensors, the state for providing various aspects for electronic equipment 800 Assess.For example, sensor cluster 814 can detect opening/closed mode of equipment 800, the relative positioning of component, such as institute Display and keypad that component is electronic equipment 800 are stated, sensor cluster 814 can also detect electronic equipment 800 or electronics The position of 800 1 components of equipment changes, the existence or non-existence that user contacts with electronic equipment 800, the orientation of electronic equipment 800 Or acceleration/deceleration and the temperature change of electronic equipment 800.Sensor cluster 814 can include proximity transducer, be configured to The presence of object near being detected in not any physical contact.Sensor cluster 814 can also include optical sensor, such as CMOS or ccd image sensor, for being used in imaging applications.In certain embodiments, the sensor cluster 814 can be with Including acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment. Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof.Show at one In example property embodiment, communication component 816 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 816 also includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 800 can be by one or more application specific integrated circuits (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 804 of instruction, above-mentioned instruction can be performed to complete the above method by the processor 820 of electronic equipment 800.Example Such as, the non-transitorycomputer readable storage medium can be ROM, it is random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device is performed so that electronic equipment is able to carry out a kind of audio recognition method, and methods described includes：

Extract the first phonetic feature of the speech data；

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device is performed so that electronic equipment is able to carry out a kind of speech recognition modeling method for building up, and methods described includes：

Fig. 8 is the structural representation of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance And produce than larger difference, can include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or the storage medium 1930 (such as one or more mass memory units) of data 1944.Wherein, memory 1932 Can be of short duration storage or persistently storage with storage medium 1930.Be stored in storage medium 1930 program can include one or More than one module (diagram is not marked), each module can include operating the series of instructions in server.Further Ground, central processing unit 1922 be could be arranged to communicate with storage medium 1930, and storage medium 1930 is performed on server 1900 In series of instructions operation.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is in server 1900 When central processor 1922 is performed so that server is able to carry out a kind of audio recognition method, and methods described includes：

Extract the first phonetic feature of the speech data；

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is in server 1900 When central processor 1922 is performed so that server is able to carry out a kind of speech recognition modeling method for building up, and methods described includes：

The one or more embodiments of the present invention, at least have the advantages that：

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can make other change and modification to these embodiments.So, appended claims are intended to be construed to include excellent Select embodiment and fall into having altered and changing for the scope of the invention.

Obviously, those skilled in the art can carry out various changes and modification without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.So, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to comprising including these changes and modification.

Claims

1. a kind of audio recognition method, it is characterised in that including：

The speech data that user produces is obtained, the user is that the user is the use that speech intelligibility is less than predetermined definition Family；

Extract the first phonetic feature of the speech data；

Speech recognition modeling based on pre-set user colony determines the first semanteme that first phonetic feature is characterized, described Pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.

2. the method as described in claim 1, it is characterised in that the speech intelligibility is less than the user of predetermined definition, bag Include：Tell the ambiguous user of word, the user leaked out that speaks, the user of lisper, can not tell in the user that word can only pronounce at least A kind of user.

3. the method as described in claim 1, it is characterised in that the speech recognition modeling of the pre-set user colony is identified by In the following manner is set up：

Collection obtains the voice sample of at least one sample user, wherein, mark it for each voice sample semantic；

Corresponding semantic phonetic feature, association correspondence phonetic feature and semanteme are determined based on the speech samples that each semanteme is included The speech recognition modeling can be obtained.

4. method as claimed in claim 3, it is characterised in that the pre-set user colony includes M kind user groups, and M is just The identification model of M kind user groups is included in integer, the speech recognition modeling of the pre-set user colony, every kind of user group's Included in identification model：Phonetic feature and semantic corresponding relation under correspondence user group；Or

The pre-set user colony includes M kind user groups, is included in the speech recognition modeling identification of the pre-set user colony The semantic corresponding relation with the phonetic feature of each user group in M kind user groups.

5. method as claimed in claim 3, it is characterised in that true in the speech recognition modeling based on pre-set user colony Make after the first semanteme that first phonetic feature is characterized, methods described also includes：

First semanteme of first phonetic feature is supplied to pre-set user, so that the pre-set user judges described Whether described the first of one phonetic feature be semantic accurate；

The pre-set user is obtained when identification described first is semantic inaccurate, the second language provided for first phonetic feature Justice；

In the speech recognition modeling, the semanteme of first phonetic feature is replaced with described first by second semanteme It is semantic.

6. the method as described in claim 1-5 is any, it is characterised in that the speech recognition mould based on pre-set user colony Type determines the first semanteme that first phonetic feature is characterized, including：

The phonetic feature is divided at least one phonetic feature fragment；

By the voice in each phonetic feature fragment at least one described phonetic feature fragment and the speech recognition modeling Feature is matched, and then identifies the semanteme of each phonetic feature fragment；

The semanteme of each phonetic feature fragment obtains phonetic feature institute table at least one comprehensive described phonetic feature fragment Described first levied is semantic.

7. the method as described in claim 1-5 is any, it is characterised in that the phonetic feature includes：Frequency range and/or frequency Rate shape.

8. the method as described in claim 1-5 is any, it is characterised in that in the speech recognition based on pre-set user colony Model Identification goes out after the first semanteme that first phonetic feature is characterized, and methods described also includes：

Judge whether include predetermined keyword in first semanteme；When comprising the predetermined keyword, by first language Justice is sent to the electronic equipment where pre-set user.

9. a kind of speech recognition modeling method for building up, it is characterised in that including：

For pre-set user colony, at least one sample user is determined, the pre-set user colony is that speech intelligibility is less than in advance Determine the colony belonging to the user of definition；

Corresponding semantic phonetic feature, association correspondence phonetic feature and semanteme are determined based on the speech samples that each semanteme is included And then obtain the speech recognition modeling.

10. method as claimed in claim 9, it is characterised in that the pre-set user colony includes M kind user groups, and M is just The identification model of M kind user groups is included in integer, the speech recognition modeling of the pre-set user colony, every kind of user group's Included in identification model：Phonetic feature and semantic corresponding relation under correspondence user group；Or

11. method as claimed in claim 9, it is characterised in that in the association correspondence phonetic feature and semantic and then acquisition After the speech recognition modeling, methods described also includes：

After the speech data that user produces is obtained, the speech recognition is identified by the speech recognition modeling, Obtain the first semantic, user of the user for speech intelligibility less than predetermined definition that the speech data is characterized；

12. the method as described in claim 9-11 is any, it is characterised in that the phonetic feature includes：Frequency range and/or Frequency shape.

13. a kind of speech recognition equipment, it is characterised in that including：

First obtains module, the speech data for obtaining user's generation, and the user is that speech intelligibility is clear less than predetermined The user of degree；

Extraction module, the first phonetic feature for extracting the speech data；

First determining module, the first phonetic feature institute table is determined for the speech recognition modeling based on pre-set user colony First levied is semantic, and the pre-set user colony is speech intelligibility less than the colony belonging to the user of predetermined definition.

14. a kind of speech recognition modeling sets up device, it is characterised in that including：

3rd determining module, for for pre-set user colony, determining at least one sample user, the pre-set user colony is Speech intelligibility is less than the colony belonging to the user of predetermined definition；

Second acquisition module, the voice sample of at least one sample user is obtained for gathering, wherein, for each voice It is semantic that sample marks it；

15. a kind of electronic equipment, it is characterised in that include memory, and one or more than one program, wherein one Individual or more than one program storage is configured to one as described in one or more than one computing device in memory Individual or more than one program bag contains the instruction for being used for being operated below：

Extract the first phonetic feature of the speech data；

16. a kind of electronic equipment, it is characterised in that include memory, and one or more than one program, wherein one Individual or more than one program storage is configured to one as described in one or more than one computing device in memory Individual or more than one program bag contains the instruction for being used for being operated below：