CN105895103A

CN105895103A - Speech recognition method and device

Info

Publication number: CN105895103A
Application number: CN201510883295.6A
Authority: CN
Inventors: 田伟森; 赵恒艺
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2016-08-24
Anticipated expiration: 2035-12-03
Also published as: CN105895103B

Abstract

The invention provides a speech recognition method and device. Speech information transmitted by a terminal is received. The acoustic characteristic information of the speech information is acquired. The acoustic characteristic information is sequentially input into an acoustic model and a language model. The acoustic model and the language model recognize the speech information to acquire initial text information. Based on pre-stored user information, the initial text information is corrected to generate final text information. According to the technical scheme provided by the invention, the recognized initial text information is corrected; errors in the initial text information are corrected; the corrected final text information is sent to the terminal; and the terminal can provide accurate services for a user according to the accurate final text information.

Description

A kind of audio recognition method and device

Technical field

The present embodiments relate to speech signal analysis technical field, particularly relate to a kind of audio recognition method And device.

Background technology

Speech recognition technology is to allow machine convert voice signals into corresponding life by identifying with understanding process Order or the technology of text.At present, speech recognition technology is widely used in the language such as speech control, voiced translation Sound interactive product.

At present, multiple terminal possesses speech voice input function, and the various application softwaries being arranged in terminal are both needed to To perform corresponding operation based on voice identification result, thus to generate the information required for user, present to User.When the speech recognition of terminal is preferable, it is possible to identify the voice messaging of user's input, ability exactly The service that enough guarantees are supplied to user is more accurate.Such as, terminal comprises map application software, user The route between current location to expectation place can be obtained from by this map application software；Such as, when User thinks in " xx restaurant, Beijing ", and terminal receives the voice messaging of user's input, i.e. to user's input Voice messaging is identified, and obtains the text message in " xx restaurant, Beijing ", and map application software is on map The text message in " xx restaurant, Beijing " is scanned for, and according to the current position of user, planning user works as The route in front position to " xx restaurant, Beijing "；But when Beijing comprises at least two restaurant title, pronunciation When being the phonetic of " xx restaurant " correspondence, then map application software will present the knowledge of multiple text message Other result, or, acquiescence is presented " the xx meal that distance users current location is nearest by map application software Shop ", now, user needs to carry out the Search Results presented manual screening, map application software according to The result of family manual screening, carries out route planning, or, terminal will present the route of mistake.

As can be seen here, current voice identification result, there is the problem that error rate is high.

Summary of the invention

The embodiment of the present invention provides a kind of audio recognition method and device, in order to solve current speech recognition knot Really, there is the problem that error rate is high.

The concrete technical scheme that the embodiment of the present invention provides is as follows:

The embodiment of the present invention provides a kind of audio recognition method, including:

Receive the VoP that terminal sends；Wherein, described VoP comprises voice messaging；

Obtain the acoustic features information of described voice messaging；Wherein, described acoustic features information is for characterizing institute State the information of the sound property of voice messaging；

Described acoustic features information is sequentially input default acoustic model and language model, obtains described Voice messaging is identified the original text information obtained；

According to the user profile prestored, it is modified described original text information processing, generates final literary composition This information；

Described final text message is sent to described terminal.

The embodiment of the present invention provides a kind of speech recognition equipment, including:

Receive unit, for receiving the VoP that terminal sends；Wherein, in described VoP Comprise voice messaging；

Acoustic features information acquisition unit, for obtaining the acoustic features information of described voice messaging；Its In, described acoustic features information is the information of the sound property characterizing described voice messaging；

Original text information acquisition unit, for sequentially inputting default acoustics by described acoustic features information Model and language model, obtain the original text information being identified obtaining to described voice messaging；

Final text message signal generating unit, for according to the user profile prestored, believing described original text Breath is modified processing, and generates final text message；

Transmitting element, for sending described final text message to described terminal.

The embodiment of the present invention provides a kind of audio recognition method and device, by receiving the voice that terminal sends Information, obtains the acoustic features information of described voice messaging；Described acoustic features information is sequentially input sound Learn model and language model, obtain described acoustic model and described voice messaging is carried out by described speech model Identify the original text information obtained；According to the user profile prestored, described original text information is carried out Correcting process, generates final text message.Use embodiment of the present invention technical scheme, for the most identified The original text information obtained is modified processing, to repair the mistake in described original text information Just, the final text message that will generate after revising sends to described terminal, makes terminal more can be as the criterion to basis True final text message, provides a user with and services the most accurately.

Accompanying drawing explanation

Fig. 1 is speech recognition system configuration diagram in the embodiment of the present invention；

Fig. 2 is speech recognition flow chart in the embodiment of the present invention one；

Fig. 3 is that the present invention implements two example data base's Establishing process figures；

Fig. 4 is speech recognition equipment structural representation in the embodiment of the present invention three.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with this Accompanying drawing in bright embodiment, is clearly and completely described the technical scheme in the embodiment of the present invention, Obviously, described embodiment is a part of embodiment of the present invention rather than whole embodiments.Based on Embodiment in the present invention, those of ordinary skill in the art are obtained under not making creative work premise The every other embodiment obtained, broadly falls into the scope of protection of the invention.

Below in conjunction with Figure of description, the embodiment of the present invention is described in further detail.

Refering to shown in Fig. 1, in the embodiment of the present invention, speech recognition system configuration diagram, this voice Identification system comprises terminal and server；Wherein, described terminal is the terminal possessing communication function, and institute Stating terminal is the terminal possessing human-computer interaction interface, if described terminal is personal computer, and panel computer, Mobile phones etc., can carry various operating system in described terminal, such as microsoft operation system, Android operation is System, ios operating system etc., and described terminal can carry various with this terminal in the operating system installed Compatible application software, such as map application software, chat tool application software etc.；Described service utensil Standby speech recognition component, speech recognition correcting part, described speech recognition component is for sending out described terminal The voice messaging sent is identified, and described speech recognition correcting part is for knowing described speech recognition component Other result is modified；Further, described server also includes voiceprint service parts, TTS (Text To Speech；From text to language), data, services parts, customer data base etc., wherein, described sound Stricture of vagina Service Part, is analyzed for the voice messaging sending described terminal, obtains initial user letter Breath, described TTS, for final text message being converted to voice messaging, described data, services parts, Initial user information for obtaining described voiceprint service parts is analyzed, and obtains final user's letter Breath, described data base is for storing the user profile that described data, services Component Analysis obtains and described The terminal iidentification that user profile is corresponding.

Embodiment one

Refering to shown in Fig. 2, in the embodiment of the present invention, server carries out the process of speech recognition, including:

Step 200: receive the VoP that terminal sends；Wherein, described VoP comprises language Message ceases.

In the embodiment of the present invention, terminal passes through voice collecting parts, calls SDK (Software Development Kit；SDK) obtain the voice messaging that user inputs；Described terminal root According to described voice messaging, generate VoP；And described VoP is sent to described service Device.

Optionally, comprising cordless communication network between described terminal and described server, described terminal is passed through The VoP comprising described voice messaging is sent to described server by described cordless communication network.

Further, after server receives the VoP that terminal sends, to the voice gathered Information is removed noise processed, and to reject the interference factor in described voice messaging, this interference factor is Such as background music during user input voice information, or background noise etc., thus ensure that acquisition is The accuracy of whole text message.

Step 210: obtain the acoustic features information of described voice messaging；Wherein, described acoustic features information For characterizing the information of the sound property of described voice messaging.

In the embodiment of the present invention, described voice messaging is resolved by the speech recognition component in server, Obtain the acoustic features information comprised in described voice messaging；Wherein, described acoustic features information is one to be Row spectrum information, reacts due to the pronunciation of each word or word and is acoustically being one section of frequency spectrum, no The frequency spectrum corresponding with the word of pronunciation is different, and therefore, the sound that this spectrum information can characterize voice messaging is special The information of property.

Step 220: described acoustic features information is sequentially input default acoustic model and language model, obtains Take the original text information that described voice messaging is identified to be obtained.

In the embodiment of the present invention, the speech recognition component in server is by the most defeated for described acoustic features information Enter default acoustic model and language model, obtain the original text letter that described language model identification obtains Breath.

Optionally, the input of described acoustic features information is preset by the speech recognition component in described server Acoustic model, obtains the pronunciation template identification of described acoustic model output；By defeated for described pronunciation template identification Enter described language model, obtain the original text information of described language model output.Wherein, described acoustics Model and described speech model adjust principle, Hidden Markov principle, or vector quantity according to dynamic time Change principle, be trained obtaining to a large amount of training samples.

Concrete, described acoustic features information is comprised in described acoustic model by described acoustic model respectively Each pronunciation template mate, and obtain described acoustic features information and comprise in described acoustic model Each pronunciation template between distance, wherein, described acoustics template includes word pronunciation model, half syllable Model or prime model；Described acoustic model, from all pronunciation templates, obtains and believes with described acoustic features The pronunciation template of each pronunciation distance minimum comprised in breath；Due to the pronunciation template in acoustic model and institute There are mapping relations in the text stated in language model, therefore, the mark of described pronunciation template is inputted described Language model, described language model can obtain the text corresponding with the mark of described pronunciation template；

Optionally, described language model comprises multiple tree, each tree with each word or Each pronunciation of person is root node, and each child node is the phrase that each word can make up；Due to often One possible corresponding multiple text of pronunciation, therefore, described language model exports for described acoustic model Each template identification of pronouncing, is performed both by operating as follows: inquire about corresponding each of this pronunciation template identification Tree, and according to the pronunciation template identification after this pronunciation template identification, obtain this pronunciation template mark The mark that pronunciation template identification after knowing corresponding text and this pronunciation template identification is corresponding；With this type of Push away, obtain all texts that described voice messaging is corresponding, and according to described all texts, generate initial literary composition This information.Wherein, described language model can export an original text information, it is also possible to exports multiple Original text information.

Use technique scheme, owing to acoustic model and language model are according to entering a large amount of voice messagings Row sentific training obtains, and therefore, voice messaging inputs described acoustic model and language model, it is possible to Obtain original text information more accurately.

Step 230: according to the user profile prestored, is modified described original text information processing, raw Become final text message.

In the embodiment of the present invention, the speech recognition correcting part in described server is from described customer data base The user profile that middle extraction prestores；And according to the user profile prestored, described original text information is carried out Correcting process；Wherein, described user profile is uploaded by terminal by user, and/or, by described server Obtain according to the voice messaging of a large number of users is identified training.

Optionally, described in the acquisition methods of user profile that prestores, including: server obtains described voice The mark of the terminal comprised in packet；The mark correspondence of described terminal is searched from user profile set User profile；Wherein, described user profile includes the position of historical time point user, the year of described user Age, or the sex of described user；Described user profile set comprises the mark of terminal and user profile Corresponding relation.

Optionally, according to the user profile prestored, it is modified described original text information processing, raw Become final text message, specifically include: described original text information is divided, obtains each point Word；For the position participle in described participle, by described search from described user profile with described currently The historical time point of time point coupling, and obtain the position of user described in the historical time point found, if The position of the user of described acquisition is mated unsuccessful with described position participle in whole or in part, and institute's rheme The position pronunciation similarity of the pronunciation and the user of described acquisition of putting participle reaches predetermined threshold value, then with described Described position participle is replaced in the position of the user obtained；For the special participle in described participle, according to institute State the age of user comprised in user profile or user's sex, described special participle is modified place Reason；Wherein, described special participle is the participle that there is unisonance not synonym.

Optionally, described current point in time and described historical time Point matching, refer to described current point in time And the time difference between described historical time point is less than Preset Time difference scope；This Preset Time difference scope root Arrange according to concrete application scenarios.

Such as, when initial text message is " how going Quanjude road conditions ", due to Beijing, to comprise many families complete Poly-moral, first server obtains the position participle " Quanjude " comprised in described original text information, clothes It is 18:00 in afternoon that business device obtains current time, and server detects that user once had three times on a 18:10 left side The right side is positioned at He Ping Men Quanjude shop, and therefore, what server will default to user's search is that " He Ping Men gathers entirely Moral ", described original text Information revision is " how going He Ping Men Quanjude road conditions " by server.

For another example, when initial text message is " traffic how ", and server is by this original text of acquiescence Comprising position participle in information, it is 18:00 in afternoon that server obtains current time, and server detects use Family is respectively positioned on " xx community " about this time point, and therefore, described original text information is repaiied by server Just for " how going xx cordon traffic situation ".

The most such as, when initial text message is " Yuxi how ", owing to " Yuxi " exists phonetically similar word " Yue-Sai ", therefore, server obtains age and the sex of described user, when the age of described user is 20-26, when the sex of described user is women, described original text Information revision is by described server " Yue-Sai how ".

Further, when the number of described process text message is multiple, server can use above-mentioned Mode, screens original text information the most accurately from described original text information, and chooses described Original text information be modified.

Further, described server can also be according to the class of the application software sending described VoP Type, is modified described original text information；Such as, the voice messaging inputted as user is " Yue-Sai How ", when the application software being currently running due to terminal is map application software, due to " Yue-Sai " Not being a place name, therefore, described original text Information revision is " Yuxi how " by server.

Further, according to the user profile prestored, it is modified described original text information processing, Generate final text message, also include: when this locality does not comprise the user profile of the mark correspondence of described terminal Time, according to described acoustic features information, determine age and the sex of the user that described voice messaging is provided； Age according to the user providing described voice messaging determined and sex, enter described original text information Row correcting process, generates final text message.

Optionally, described acoustic features information, determine age and the property of the user that described voice messaging is provided Not, specifically include: voiceprint service parts extract the biological attribute data in described acoustic features information, its In, described biological attribute data comprises tone color, tonequality, tone, word speed etc.；Described voiceprint service parts According to described biological attribute data, and described acoustic model, obtain age and the sex of described user.

Step 240: described final text message is sent to described terminal.

In the embodiment of the present invention, server by described final text message by cordless communication network send to Described terminal.

Further, after generating final text message, described server can be by described final text envelope Breath is converted to voice messaging；And described voice messaging is sent to described terminal, by described in terminal plays Whole text message.

Further, after generating final text message, described server can be according to described final text Information, obtains the service of described user request, and generates the data of the service correspondence that described user is asked Bag sends to terminal.Wherein, described packet can be textual form, it is also possible to for speech form.

Use technique scheme, according to the customized information of user, for the most identified obtain initial Text message is modified processing, and to be modified the mistake in described original text information, thus carries The high accuracy of speech recognition；Further, the final text message generated after revising sent to described end End, makes terminal can provide a user with take the most accurately to according to the most final text message Business.

Embodiment two

Refering to shown in Fig. 3, in the embodiment of the present invention, the user profile comprised in the data base of server Generation process, including:

Step 300: receive the VoP that terminal sends；Wherein, described VoP comprises language Message ceases.

Step 310: obtain the acoustic features information comprised in described voice messaging.

Step 320: according to described acoustic features information, determine the age of the user that described voice messaging is provided And sex, and final text message；According to determine provide described voice messaging user age and Sex.

Optionally, server can also be according to described acoustic features information, acquisition environmental data, such as, Time and user's sphere of action etc..

Step 330: age and the sex to the user determined, and finally text message is analyzed, and According to analysis result, generate user profile.

Optionally, described server can also generate user profile according to described environmental data.

Step 340: set up the corresponding relation between the mark of described terminal, and the user profile generated, will Described corresponding relation stores to described user profile set.

Embodiment three

Based on technique scheme, refering to shown in Fig. 4, in the embodiment of the present invention, it is provided that a kind of internal memory is empty Between cleaning plant, including receive unit 40, acoustic features information acquisition unit 41, original text information obtains Take unit 42, final text message signal generating unit 43, and transmitting element 44, wherein:

Receive unit 40, for receiving the VoP that terminal sends；Wherein, described VoP In comprise voice messaging；

Acoustic features information acquisition unit 41, for obtaining the acoustic features information of described voice messaging；Its In, described acoustic features information is the information of the sound property characterizing described voice messaging；

Original text information acquisition unit 42, for sequentially inputting default sound by described acoustic features information Learn model and language model, obtain the original text information being identified described voice messaging obtaining；

Final text message signal generating unit 43, for according to the user profile prestored, to described original text Information is modified processing, and generates final text message；

Transmitting element 44, for sending described final text message to described terminal.

Further, described VoP also comprises terminal iidentification；Also include that prestored information obtains single Unit 45, is used for: search the user profile that the mark of described terminal is corresponding from user profile set；Its In, described user profile includes the position of historical time point user, the age of described user, or described The sex of user；Described user profile set comprises the mark of terminal and the corresponding relation of user profile.

Optionally, described original text information acquisition unit 42, specifically include: described acoustic features is believed The acoustic model that breath input is preset, obtains the pronunciation template identification of described acoustic model output；By described Sound template identification inputs described language model, obtains the original text information of described language model output.

Optionally, described final text message signal generating unit 43, specifically for: described original text is believed Breath divides, and obtains each participle；For the position participle in described participle, by described from described User profile is searched the historical time point with described current time Point matching, and obtains the history found The position of user described in time point, if the position of the user of described acquisition and described position participle all or Part is mated unsuccessful, and the pronunciation of described position participle is similar to the pronunciation of the position of the user of described acquisition Degree reaches predetermined threshold value, then replace described position participle with the position of the user of described acquisition；For described Special participle in participle, according to the age of user comprised in described user profile or user's sex, right Described special participle is modified processing；Wherein, described special participle is to there is dividing of unisonance not synonym Word.

Further, described final text message signal generating unit 43, it is additionally operable to: when this locality does not comprise described When identifying corresponding user profile of terminal, according to described acoustic features information, determines the described voice of offer The age of the user of information and sex；Age according to the user providing described voice messaging determined and property , it is not modified described original text information processing, generates final text message.

Further, also include processing unit 46, be used for: after generating final text message, to determining Age of user and sex, and final text message is analyzed, and according to analysis result, generates User profile；Set up the corresponding relation between the mark of described terminal, and the user profile generated, by institute State corresponding relation to store to described user profile set.

In sum, in the embodiment of the present invention, by receiving the voice messaging that terminal sends, obtain described The acoustic features information of voice messaging；Described acoustic features information is sequentially input acoustic model and language mould Type, obtain that described voice messaging is identified obtaining by described acoustic model and described speech model is initial Text message；According to the user profile prestored, it is modified described original text information processing, generates Final text message.Use embodiment of the present invention technical scheme, for the most identified original text obtained Information is modified processing, to be modified the mistake in described original text information, raw after revising The final text message become sends to described terminal, makes the terminal can be to according to the most final text envelope Breath, provides a user with and services the most accurately.

Device embodiment described above is only schematically, wherein said illustrates as separating component Unit can be or may not be physically separate, the parts shown as unit can be or Person may not be physical location, i.e. may be located at a place, or can also be distributed to multiple network On unit.Some or all of module therein can be selected according to the actual needs to realize the present embodiment The purpose of scheme.Those of ordinary skill in the art are not in the case of paying performing creative labour, the most permissible Understand and implement.

Through the above description of the embodiments, those skilled in the art is it can be understood that arrive each reality The mode of executing can add the mode of required general hardware platform by software and realize, naturally it is also possible to by firmly Part.Based on such understanding, the portion that prior art is contributed by technique scheme the most in other words Dividing and can embody with the form of software product, this computer software product can be stored in computer can Read in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that one Computer equipment (can be personal computer, server, or the network equipment etc.) performs each to be implemented The method described in some part of example or embodiment.

Last it is noted that above example is only in order to illustrate the technical scheme of the embodiment of the present invention, and Non-to its restriction；Although the embodiment of the present invention being described in detail with reference to previous embodiment, ability The those of ordinary skill in territory is it is understood that it still can be to the technical scheme described in foregoing embodiments Modify, or wherein portion of techniques feature is carried out equivalent；And these amendments or replacement, The essence not making appropriate technical solution departs from spirit and the model of the embodiment of the present invention each embodiment technical scheme Enclose.

Claims

1. an audio recognition method, it is characterised in that including:

Described final text message is sent to described terminal.

Method the most according to claim 1, it is characterised in that also comprise in described VoP Terminal iidentification；

The acquisition methods of the described user profile prestored, including:

The user profile that the mark of described terminal is corresponding is searched from user profile set；Wherein, described use Family information includes the property of the position of historical time point user, the age of described user, or described user Not；Described user profile set comprises the mark of terminal and the corresponding relation of user profile.

Method the most according to claim 2, it is characterised in that by described acoustic features information successively Acoustic model that input is preset and language model, obtain be identified obtaining initial to described voice messaging Text message, specifically includes:

The acoustic model input of described acoustic features information preset, obtains sending out of described acoustic model output Sound template identification；

Described pronunciation template identification is inputted described language model, obtains the initial of described language model output Text message.

The most according to the method in claim 2 or 3, it is characterised in that according to the user's letter prestored Breath, is modified described original text information processing, generates final text message, specifically include:

Described original text information is divided, obtains each participle；For the position in described participle Put participle, from described user profile, search the historical time with described current time Point matching by described Point, and obtain the position of user described in the historical time point found, if the position of the user of described acquisition Mate unsuccessful in whole or in part with described position participle, and the pronunciation of described position participle obtains with described The position pronunciation similarity of the user taken reaches predetermined threshold value, then replace with the position of the user of described acquisition Described position participle；For the special participle in described participle, according to the use comprised in described user profile Family age or user's sex, be modified described special participle processing；Wherein, described special participle For there is the participle of unisonance not synonym.

Method the most according to claim 4, it is characterised in that according to the user profile prestored, right Described original text information is modified processing, and generates final text message, also includes:

When this locality does not comprise the user profile of mark correspondence of described terminal, believe according to described acoustic features Breath, determines age and the sex of the user providing described voice messaging；

Age according to the user providing described voice messaging determined and sex, believe described original text Breath is modified processing, and generates final text message.

Method the most according to claim 5, it is characterised in that after generating final text message, Described method also includes:

Age and sex to the user determined, and finally text message is analyzed, and according to analysis As a result, user profile is generated；

Set up the corresponding relation between the mark of described terminal, and the user profile generated, by described correspondence Relation stores to described user profile set.

7. a speech recognition equipment, it is characterised in that including:

Device the most according to claim 7, it is characterised in that also comprise in described VoP Terminal iidentification；

Also include prestored information acquiring unit, be used for:

Device the most according to claim 8, it is characterised in that described original text acquisition of information list Unit, specifically for:

Device the most according to claim 8 or claim 9, it is characterised in that described final text message Signal generating unit, specifically for:

Described original text information is divided, obtains each participle；

For the position participle in described participle, by described search from described user profile with described currently The historical time point of time point coupling, and obtain the position of user described in the historical time point found, if The position of the user of described acquisition is mated unsuccessful with described position participle in whole or in part, and institute's rheme The position pronunciation similarity of the pronunciation and the user of described acquisition of putting participle reaches predetermined threshold value, then with described Described position participle is replaced in the position of the user obtained；

For the special participle in described participle, according to the age of user comprised in described user profile or User's sex, is modified described special participle processing；Wherein, for there is unisonance in described special participle The not participle of synonym.

11. devices according to claim 10, it is characterised in that described final text message generates Unit, is additionally operable to:

12. devices according to claim 11, it is characterised in that also include processing unit, use In:

After generating final text message, age and the sex to the user determined, and final text envelope Breath is analyzed, and according to analysis result, generates user profile；