CN105931642A

CN105931642A - Speech recognition method, apparatus and system

Info

Publication number: CN105931642A
Application number: CN201610375073.8A
Authority: CN
Inventors: 汤跃忠
Original assignee: Beijing Linglong Technology Co Ltd
Current assignee: iFlytek Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-05-31
Filing date: 2016-05-31
Publication date: 2016-09-07
Anticipated expiration: 2036-05-31
Also published as: CN105931642B

Abstract

The invention provides a speech recognition method, a speech recognition apparatus and a speech recognition system. The method includes the following steps that: speech input of a user is obtained; a speech database is selected to recognize speech inputted by the user, and recognition outputs adopted as results are outputted; and domain determination is adopted to select one or more candidate optimal recognition outputs from the recognition outputs; and an optimal recognition output in the one or more candidate optimal recognition outputs is determined with the personality identification information of the user adopted as a determination condition. With the speech recognition method provided by the above technical schemes of the invention adopted, the accuracy of speech recognition can be improved under a condition that response time is not increased.

Description

Audio recognition method, equipment and system

Technical field

The present invention relates to field of speech recognition, be specifically related to a kind of audio recognition method, equipment and be System.

Background technology

Along with the application popularization of smart machine, speech recognition system becomes the new means of Information application, Meanwhile, speech recognition system is passed through, it is possible to achieve the Based Intelligent Control of equipment.

In the use of speech recognition system, Consumer's Experience becomes the emphasis that numerous system is focused on. Accuracy rate for the application of speech recognition system, response time and judgement becomes Consumer's Experience lifting Core content.And in current judgement form, mostly use specific data model to carry out voice The judgement of data.This judgement form uses general system to carry out the judgement of all of voice environment. And this judgement form will necessarily strengthen the live load of speech recognition, extend the response judgement time, Thus reduce the experience of user.

In the art, common automatic speech recognition system (ASR) is by identifying automotive engine system Carry out the identification of phonetic entry.The engine model of speech recognition system is generally by acoustic model and language Speech model two parts composition, corresponds respectively to voice to the calculating of syllable probability and syllable to word probability Calculating.Language model is broadly divided into rule model and statistical model two kinds, and it uses probability statistics Method disclose in linguistic unit statistical law.Above-mentioned engine unit is judged by ken, Complete the identification output of phonetic entry.

There is various ways can indicate by general-purpose system is increased specific user profile, thus carry out The voice of particular range judges, thus improves response time, improves determination rate of accuracy.In this area Common form is: setting for different dialects, the data base that accent form sets classifies, thus Phonetic entry can be carried out system classification, it is achieved response faster in initial decision stage Time.Volume above-mentioned data base is selected can increase specific message identification in form.This information Mark can come from user side.Identification information can be by adding the speech input information of user Work and get.Same identification information can obtain by other means, such as by using The positional information at family, the signal source etc. of mobile device.Using above-mentioned information as the identification information of user It is input in ASR system, thus assists the selected differentiation of the data of user, improve response time, Reduce False Rate.

Although but above-mentioned form adds the identification information of user, but above-mentioned information is only logical Cross for language form, the input of positional information, carry out help system and carry out the selected of language database. This form is while reducing response time, in final recognition result output, and can not Exported by the purposiveness obtaining relative users of the utilization of above-mentioned identification information, i.e. recognition efficiency The highest.

It is thus desirable to a kind of recognition methods, it can improve in the case of promoting obtaining response time The recognition efficiency of user.

Summary of the invention

In order to solve the problems referred to above, embodiments provide a kind of audio recognition method, equipment And system, with under conditions of not increasing response time, improve the accuracy rate of speech recognition.

A scheme according to the present invention, it is provided that a kind of audio recognition method, including: obtain and use The phonetic entry at family；Select speech database to identify the voice that user inputs, and export as knot The identification output of fruit；Use field judges to select one or more candidate from described identification output Optimal identification exports；And judge described one using the personal sign information of user as decision condition Optimal identification during individual or multiple candidate's optimal identification export exports.

According to another aspect of the present invention, it is provided that a kind of speech recognition apparatus, including: voice obtains Take unit, for obtaining the phonetic entry of user；Voice recognition unit, is used for selecting speech data Storehouse identifies the voice that user inputs, and exports the identification output as result；First identifying unit, Select one or more candidate's optimal identification defeated for using field to judge from described identification output Go out；And second identifying unit, sentence as decision condition for the personal sign information using user Optimal identification output in fixed the one or more candidate's optimal identification output.

Third program according to the present invention, it is provided that a kind of speech recognition system, including: above-mentioned Speech recognition apparatus；And the client device communicated to connect with described speech recognition apparatus.

Such scheme carries out the second level outcome of speech recognition by the customizing messages mark using user Judge, and this result of determination is exported as final result, it is achieved that speech recognition judges defeated The multi-level output gone out, the judgement scope judging output simultaneously newly increased uses the output that field judges Result is as input.Therefore, it can only retain a small amount of result and come for final judgement, therefore, Such scheme can't increase the load of system, can on the premise of not reducing response time more Judge the output result of speech recognition accurately.

Accompanying drawing explanation

By the detailed description below in conjunction with the accompanying drawings invention carried out, the features described above of the present invention will be made Become apparent from advantage, wherein:

Fig. 1 is the indicative flowchart of audio recognition method according to an embodiment of the invention；

Fig. 2 provides and utilizes the native place information of user to carry out speech recognition according to embodiments of the invention The flow chart of method；

Fig. 3 shows the flow chart of another audio recognition method according to embodiments of the present invention；

Fig. 4 is to illustrate according to an embodiment of the invention for realizing the voice knowledge of audio recognition method The schematic block diagram of other equipment；And

Fig. 5 shows the schematic block diagram of speech recognition system according to embodiments of the present invention.

Detailed description of the invention

Below, the preferred embodiment of the present invention is described in detail with reference to the accompanying drawings.In the accompanying drawings, although It is shown in different accompanying drawings, but identical reference is used for representing identical or similar assembly. For clarity and conciseness, the detailed description being included in known function here and structure will be omitted, To avoid making subject of the present invention unclear.

Fig. 1 shows the indicative flowchart of audio recognition method according to an embodiment of the invention.

As it is shown in figure 1, in step S01, obtain the phonetic entry of user.

In some instances, client device (such as, this client can being currently in use by user The voice receiving unit of end equipment, such as mike etc.) obtain user phonetic entry.Then with this The speech recognition apparatus that client device communications connects can obtain phonetic entry from client device.

Here, the client device that used of user can with the mobile phone of user, fixed terminal, PDA (personal digital assistant), notebook computer, net book, panel computer etc., but this Bright be not limited to this, but those skilled in the art can be used it is contemplated that any movement or non-moving Equipment is used as client device.

Speech recognition apparatus described herein is referred to alternatively as server, high in the clouds in some implementations Server, remote terminal etc., but the present invention is similarly not so limited to, the voice in the present invention is known Other equipment can be used for realizing any equipment of inventive technique scheme, is mobile regardless of it Or the most non-moving, what is regardless of its title in implementing.

In some instances, the voice messaging of user can be by unit such as the mikes of client device It is read out.The voice messaging of this user can be converted into electronic signal and store, such as, User can carry out phonetic entry by the microphone system of electronic equipment: " broadcasting musical play ", " play Opera ", " I wants to listen Shaoxing opera " etc..

The most in some instances, such as in the case of speech recognition apparatus is positioned at user this locality, Can not also use client device, user can be at speech recognition apparatus (such as, its mike) Place directly inputs voice.

At step S02, select speech database to identify the voice that user inputs, and export work Identification for result exports.

In some instances, optional speech database to be used, and according to selected voice Data base, utilizes acoustic model and the language model etc. of speech recognition engine to carry out the identification of voice, And export recognition result.

In step S03, use field judges to select one or more candidate from identification output Optimal identification exports.

Can judge that selecting most preferred candidate to export result exports by field.In the output may be used Comprise multiple output result to be selected；The most multiple results to be selected can be that " I wants to listen more Acute ", multiple results such as " I wants to listen Guangdong opera ".Certainly, in some cases, it is also possible to only export one Individual output result.

Alternatively, in step S04, the personal sign information of detection user.

This step can perform between step S03 and step S05 that next will elaborate, But the invention is not restricted to this, this step also can perform the whenever execution before step S05. Such as, in the case of user is used for multiple times this speech recognition apparatus, it is also possible to user is at it in storage The personal sign information detected during front use speech recognition apparatus, and in this identifies, use institute The personal sign information of storage.

Personal sign information such as can include the used movement of the geographical location information of user, user The current connecting signal source of equipment, the native place of user and other abilities of personalizable mark user The information that field technique personnel are known.The geographical location information of user can obtain in several ways Take.The collection of this information can be the combination using various ways, or individually uses a kind of mode to carry out Obtain, such as, may include that connecting IP address by the network of user obtains, such as when with Family uses the intelligent sound equipment connecting cloud server, can pass through the detection of user network information, The address obtaining user is " Shaoxing, Zhejiang Province city "；Or can be by the mobile device of user The base station location being associated is determined；The GPS system of the mobile device of user can also be passed through, The Geographic mapping carrying out user obtains.The one in above-mentioned multiple acquisition mode can be used, The combination in any that can also use multiple acquisition mode avoids erroneous judgement (such as when Internet user makes When using proxy server, it is difficult to judged the position of user by the network information).

In step S05, judge this using the personal sign information of user as decision condition Or the optimal identification output in the output of multiple candidate's optimal identification.

Using the personal sign information of above-mentioned user as decision condition, come multiple candidate's optimal identification Output judges, further by the retrieval of little scope, it is determined that above-mentioned multiple candidates are optimum Identify optimal optimal identification output in output.Such as, the candidate determined in step S03 is Excellent identification output is " I wants to listen Guangdong opera " and " I wants to listen Shaoxing opera ", and such as by above-mentioned steps S04 The geography information position of user obtained is " Shaoxing, Zhejiang Province city ", then by using above-mentioned information as Decision condition, carries out the inspection of low sample size to the candidate's optimal identification output determined in step S03 Rope, may thereby determine that output result is " I wants to listen Shaoxing opera ".

Therefore, improve by user personality identification information and the relatedness identified between output The accuracy rate identified.And in said method, in step S05, only in little scope identification field In carry out identification decision again, therefore, this judgement form will not carry on overall response time Carrying out excessive load, therefore, aforesaid way ensure that at response time substantially without the premise increased Under, improve the discrimination of user speech input, thus obtain higher Consumer's Experience.

In another example, if user inputs " I wants to listen swan goose " by intelligent voice system, Judged by system, it can be deduced that what probability was bigger is combined as " swan goose " or " red gorgeous ", and Above-mentioned both can be as the output form of multiple optimum combination, and in final selected form data Storehouse increases personalized identification and carries out system judgement, according to acquired different personalized identification, It is eventually led to different result output, thus improves the experience of user largely, precisely know Other user's request.

And when the input information of user clearly guides such as geography information position, in above-mentioned multiple times After selecting optimal result to judge output, to avoid the retrieval form of small sample, but user can be known Other geography information indicates directly as judging information, compares output with multiple optimal solutions, from And faster obtain the output of result: such as user's input " Chaoyang weather ", then in multiple courts of output In sun area, the geography information mark identified by user is selected.Aforesaid way is the simplest Change recognition mode, but aforesaid way, only it is limited to multiple optimal solution and is all directed to identical Under conditions of personal sign information (such as geography information).

Also likely to be present the candidate of output in step S03 and identify that the number of output is only the situation of 1.? In the case of Gai, can bypass the process of step S05.But in some other example, it is possible to so that With step S05 come this candidate of determination step S03 identify output if appropriate for, and abandon substantially Unaccommodated identification exports, and again points out user input voice.

Judging that optimal identification output is last, in step S06, exporting the identification output of this optimum. At this, the adoptable way of output may include but be not limited to sound, image, text or this area use Exporting other any modes of information, this is not limited by the present invention.

Above to the description of technical scheme uses the geographical location information of user as user The example of property identification information, but other personal sign information can also be used.The nationality of such as user Pass through information etc..

In the case of the native place information using user, the dialect that can be inputted by user speech, mouth Sound judges, so that it is determined that user native place information.Fig. 2 provides according to embodiments of the invention The native place information utilizing user carries out the flow chart of the method for speech recognition.

When step S01 shown in Fig. 2 obtains the phonetic entry of user, can be by acquired user The dialect of speech recognition user and/or accent attribute, to judge the native place information (step of user S07)。

After obtaining above-mentioned native place information, use this native place information as user's in step S05 Personal sign information carries out the judgement of optimum output result.

Such as in step S02, the dialect attribute of above-mentioned voice can be judged by speech recognition system, Result of determination e.g. " Zhejiang dialect ".

Then can using above-mentioned " Zhejiang dialect ", attribute be as decision condition, to step in step S05 The multiple candidate's optimal result selected in rapid S03 judge further.Such as in required judgement When candidate's optimal result is " I wants to listen Shaoxing opera " and " I wants to listen Guangdong opera ", then can pass through decision condition " Zhejiang Jiang Fangyan ", it is determined that it is " I wants to listen Shaoxing opera " for final output result.

The above-mentioned mode performing to judge as the personal sign information of user using native place information is permissible It is avoided by equipment association and obtains the misjudgment that geography information mark is caused, such as, work as user It it is native place, Zhejiang and issuable when currently using above-mentioned speech recognition apparatus in the case of Guangdong Mistake.

Below use geographical location information and native place information conduct are described with reference to Fig. 1 and Fig. 2 respectively The situation of the personal sign information of user.But in some instances, it is also possible to by above two feelings Condition combines to obtain result of determination more accurately.For example, it is possible to by the native place information of user and The geographical location information of user combines and uses as personal sign information.

Specific embodiment three, detailed description of the invention three are to be combined with two by above-described embodiment one Form, wherein it is possible in concrete mode, the comprehensive judgement using native place information and geographical position The judgement of confidence breath, compares the two result of determination, and using comparison result as S05 In judgement identification information.Such as, the result of determination identical (e.g., being all Zhejiang) at both In the case of, using this result of determination as judging identification information.But in further embodiments, as Really both result of determination are different, such as can give native place according to default or user setup and judge Or geographical position judges higher priority.Or in further embodiments, have more In the case of property identification information, it is possible to combining this more personal sign information judges, example As, distribute different weights for different identification informations, and select the result of determination of PTS maximum. Any other use that technical scheme can use those skilled in the art to be readily apparent that is many Plant the decision method of different personal sign information, do not repeat them here.

Above example only uses in the judgement of step S05 the personal sign information of user, but In some instances, it is also possible to use the personal sign of user to believe in the speech recognition of step S02 Breath.Fig. 3 shows the flow chart of another audio recognition method according to embodiments of the present invention.

As it is shown on figure 3, in step S01, obtain the phonetic entry of user.

In a subsequent step, the personal sign information of user is detected.Such as can be according to user's The native place information of phonetic entry detection user, or detect the geographical position letter of user by other means Breaths etc., this is not limited by the present invention.Certainly, as it was previously stated, this detecting step can make With the whenever execution before personal sign information (in this example, before step S02). In some cases, it might even be possible to use the personal sign information obtained before stored.

Then, in the speech recognition steps of step S02, using above-mentioned personal sign information as step The standard that in rapid S02, data base selects, to accelerate the carrying out of speech recognition；

In subsequent step, make to carry out in the same way data identification, and in S05, again Using above-mentioned personal sign information, carry out the judgement of small sample, final precisely acquisition exports data.

Employing personal sign information in the examples described above for twice, above-mentioned identification information makes for the first time Effect in is by voice and judges that the selection of data base is (such as by specific geography information mark Know the data base selecting to use in speech recognition), and second time uses geography information mark to be to use Candidate's optimal result selects suitably judge output, even if because have selected suitable voice Data base, there will be equally and inappropriate output information according to probabilistic combination, therefore, it can lead to The personal sign information (such as native place information, or above-mentioned geography information mark) crossing user is entered The screening of row optimal result.

Fig. 4 is to illustrate according to an embodiment of the invention for realizing the language of above-mentioned audio recognition method The schematic block diagram of sound identification equipment.As shown in Figure 4, this speech recognition apparatus can include that voice obtains Unit 410, for obtaining the phonetic entry of user；Voice recognition unit 420, is used for selecting voice Data base identifies the voice that user inputs, and exports the identification output as result；First judges Unit 430, is used for using field to judge from identification output and selects one or more candidate's optimum to know Do not export；And second identifying unit 440, for using the personal sign information of user as judging bar Part judges the optimal identification output in the output of these one or more candidate's optimal identification.

In some instances, voice recognition unit 420 can be additionally used in: according to selected voice number According to storehouse, utilize the acoustic model of speech recognition engine and language model to identify the voice that user inputs.

In some instances, this speech recognition apparatus may also include that information detecting unit 450, uses Personal sign information in detection user.

In some instances, this speech recognition apparatus may also include memorizer 460, is used for storing letter The personal sign information that breath detector unit 450 detects.Additionally, this memorizer can also storaged voice Any data that identification equipment is used when carrying out speech recognition, the most above-mentioned speech database Deng, this is not limited by the present invention.

The personal sign information of heretofore described user can include user geographical location information, One or more in the current connecting signal source of the used mobile device of user and the native place of user. And as it was previously stated, in the present invention personal sign information of user be not limited to this, but can be ability Territory is known any information of user for personalized placemarks.

In some instances, information detecting unit 450 is additionally operable to: by carrying out user's input Identify during the identification of voice that the dialect of user and/or accent attribute obtain the native place of user.

In some instances, voice recognition unit 420 is additionally operable to: use the personal sign letter of user Breath selects the speech database for speech recognition.

Speech recognition apparatus according to embodiments of the present invention is described above with the form of module/unit Schematic block diagram.It is noted, however, that one or more in this module/unit can be led to Cross one or more particular hardware to realize.Additionally, Fig. 4 is merely to explain technical scheme And use schematic block diagram.In actual realization, it is also possible to include more or less Module/unit.Such as, in some implementations, it is also possible to include setting for the output exporting information Standby, such as speaker, display etc..And in some implementations, it is also possible to include various storage device, Data/program required in technical scheme or produced data/journey is realized to be stored in Sequences etc., the present invention is not limited except as.

Fig. 5 shows the schematic block diagram of speech recognition system according to embodiments of the present invention.Such as figure Shown in 5, this speech recognition system includes according to the speech recognition system cloud server shown in Fig. 4 (or referred to as speech recognition apparatus) and the client voice intelligence with speech recognition apparatus communication connection Can equipment (or referred to as client device).As it was previously stated, when user coexists with speech recognition apparatus During one ground, it is also possible to omit client device.User directly can input language at speech recognition apparatus Sound.

The voice recognition processing of the speech recognition apparatus shown in Fig. 5 and reference Fig. 1, Fig. 2 and Tu 3 process described are identical, do not repeat them here.

Further, it should be noted that technical scheme described in the embodiment of the present invention is not being conflicted In the case of can be in any combination.

In several embodiments provided by the present invention, it should be understood that disclosed method and setting Standby, can realize by another way.Apparatus embodiments described above is only schematically , such as, the division of described unit, be only a kind of logic function and divide, actual can when realizing To have other dividing mode, such as: multiple unit or assembly can be in conjunction with, or are desirably integrated into another One system, or some features can ignore, or do not perform.It addition, shown or discussed is each Ingredient coupling each other or direct-coupling or communication connection can be to be connect by some Mouthful, equipment or the INDIRECT COUPLING of unit or communication connection, can be electrical, machinery or other Form.

The above-mentioned unit illustrated as separating component can be or may not be physically separate, The parts shown as unit can be or may not be physical location, i.e. may be located at one Local, it is also possible to be distributed on multiple NE；Can select therein according to the actual needs Partly or entirely unit realizes the purpose of the present embodiment scheme.

It addition, each functional unit in various embodiments of the present invention can be fully integrated into one second In processing unit, it is also possible to be that each unit is individually as a unit, it is also possible to two or two Individual above unit is integrated in a unit；Above-mentioned integrated unit both can be to have used the form of hardware Realizing, electricity can realize with the form using hardware to add SFU software functional unit.

Above description is only used for realizing embodiments of the present invention, and those skilled in the art should Understand, in any modification or partial replacement without departing from the scope of the present invention, all should belong to this The scope that bright claim limits, therefore, protection scope of the present invention should be with claim The protection domain of book is as the criterion.

Claims

1. an audio recognition method, including:

Obtain the phonetic entry of user；

Select speech database to identify the voice that user inputs, and it is defeated to export the identification as result Go out；

Use field judges to select one or more candidate's optimal identification from described identification output Output；And

The one or more candidate is judged as decision condition using the personal sign information of user Optimal identification output in optimal identification output.

Audio recognition method the most according to claim 1, wherein, described selection speech data Storehouse identifies that the voice that user inputs includes:

According to selected speech database, utilize acoustic model and the language mould of speech recognition engine Type identifies the voice that user inputs.

Audio recognition method the most according to claim 1, also includes:

The described personal sign information of detection user.

Audio recognition method the most according to claim 3, wherein, the described individual character mark of user What knowledge information included the used mobile device of the geographical location information of user, user currently connects signal One or more in the native place of source and user.

Audio recognition method the most according to claim 4, wherein, by carry out user defeated Identify during the identification of the voice entered that the dialect of user and/or accent attribute are to obtain the nationality of described user Pass through.

Audio recognition method the most according to claim 1, also includes:

The described personal sign information using user selects the speech database for speech recognition.

7. a speech recognition apparatus, including:

Voice acquiring unit, for obtaining the phonetic entry of user；

Voice recognition unit, for selecting speech database to identify the voice that user inputs and defeated Go out the identification as result to export；

First identifying unit, be used for using field judge to come from described identify output selects one or Multiple candidate's optimal identification export；And

Second identifying unit, judges institute for the personal sign information using user as decision condition State the optimal identification output in the output of one or more candidate's optimal identification.

Speech recognition apparatus the most according to claim 7, wherein, described voice recognition unit It is additionally operable to:

Speech recognition apparatus the most according to claim 7, also includes:

Information detecting unit, for detecting the described personal sign information of user.

Speech recognition apparatus the most according to claim 9, wherein, the described individual character mark of user What knowledge information included the used mobile device of the geographical location information of user, user currently connects signal One or more in the native place of source and user.

11. speech recognition apparatus according to claim 10, wherein, described infomation detection list Unit is additionally operable to: by identifying dialect and/or the mouth of user when carrying out the identification of voice of user's input Sound attribute obtains the native place of described user.

12. speech recognition apparatus according to claim 7, wherein, described voice recognition unit It is additionally operable to:

13. 1 kinds of speech recognition systems, including:

According to the speech recognition apparatus according to any one of claim 7 to 12；And

The client device communicated to connect with described speech recognition apparatus.