CN106875949A

CN106875949A - A kind of bearing calibration of speech recognition and device

Info

Publication number: CN106875949A
Application number: CN201710291330.4A
Authority: CN
Inventors: 石日俭; 贺磊; 刘旭; 吕晓霞
Original assignee: SZBROAD TECHNOLOGY Co Ltd
Current assignee: SZBROAD TECHNOLOGY Co Ltd; SSK Corp
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2017-06-20
Anticipated expiration: 2037-04-28
Also published as: CN106875949B

Abstract

Bearing calibration and device the embodiment of the invention discloses a kind of speech recognition, the method include：According to setting testing equipment detection data determine user residing for current application scene；Speech recognition is carried out to the sound for detecting under the current application scene；The language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carries out deep learning, obtains learning outcome；The result of speech recognition is corrected according to the learning outcome.The embodiment of the present invention disclosure satisfy that the requirement of application-specific scene speech recognition, with targetedly speech recognition is carried out to each application scenarios, greatly improve the accuracy of speech recognition, and then promote man-machine interaction, can have wide range of applications.

Description

A kind of bearing calibration of speech recognition and device

Technical field

The present invention relates to voice processing technology, more particularly to a kind of speech recognition bearing calibration and device.

Background technology

With the development of science and technology, the mankind have been enter into artificial intelligence epoch, wisdom and energy of the artificial intelligence for the mankind of extending Power, simulates the thought process and intelligent behavior of the mankind, enables the machine to the competent complexity for generally needing human intelligence to complete Work.One of important branch of artificial intelligence includes speech recognition, character translation and phonetic synthesis, speech recognition technology It is that the voice signal of input is transformed into corresponding text by machine by identification and understanding process, realizes man-machine exchange； Character translation technology is that the word for arriving speech recognition is sentence according to correct syntactic translation；Speech synthesis technique (Text to Speech, abbreviation TTS) be by machine produce or outside input text information be changed into similar mankind's expression way voice simultaneously Output.

At present, the speech recognition technology that University of Science and Technology's news fly, Microsoft, the company such as Google develop is based on having huge cloud number Calculated according to the big data platform of disposal ability, data volume has the characteristics of greatly and extensively, can substantially realize that man-machine language is handed over Mutually, but, identification for the application-specific sentence under application-specific scene and translation are often not accurate enough.

In the bearing calibration of prior art, generally using statistics or the method for machine learning, progressively filtering obtains correction Set.But the process substantially identical that this method is corrected due to lack of targeted, the input to each user, because The accuracy of this correction is not high.For example, the voice " lihua " of different user is received, the correspondence text obtained by initial identification This is " Li Hua ", may all be corrected to " pear flower ", " physics and chemistry " or " fireworks display ", i.e., had more not according to different application scenarios Targetedly obtain correction result.

The content of the invention

The embodiment of the present invention provides bearing calibration and the device of a kind of speech recognition, to solve in the prior art to know voice The inaccurate problem of other calibration of the output results.

In a first aspect, a kind of bearing calibration of speech recognition is the embodiment of the invention provides, including：

According to setting testing equipment detection data determine user residing for current application scene；

Speech recognition is carried out to the sound for detecting under the current application scene；

The language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carries out depth Practise, obtain learning outcome；

The result of speech recognition is corrected according to the learning outcome.

Further, the current application scene according to residing for the detection data for setting testing equipment determines user, bag Include following at least one：

Sound to detecting carries out speech recognition, judges that speech recognition obtains the corresponding application of corpus belonging to language material Scene；

Position where detecting mobile terminal by locating module, obtains the current application scene residing for user；

The feature of application scenarios is detected by bluetooth digital signal processing appts, current application is determined according to the feature Scape.

Further, it is described according to setting testing equipment detection data determine user residing for current application scene it Before, also include：

The corpus under each application scenarios is grouped using clustering algorithm, the result according to the packet extracts language Material feature；

The language material feature is trained, the deep learning model of each application scenarios of correspondence is created.

Further, it is described the result of speech recognition is corrected according to the learning outcome, including：

If the learning outcome is the result of the speech recognition and current application scene mismatched, the voice is known Other calibration of the output results is corresponding result under current application scene.

Further, the corpus includes：The language material of the user input for having stored, the language material by screening and/or school The language material that the result of positive speech recognition is obtained.

Second aspect, the embodiment of the present invention additionally provides a kind of means for correcting of speech recognition, including：

Scene determining module, for the current application residing for determining user according to the detection data of setting testing equipment Scape；

Sound identification module, speech recognition is carried out for the sound to detecting under the current application scene；

Deep learning module, for being obtained to speech recognition based on the corresponding deep learning model of the current application scene Language material carry out deep learning, obtain learning outcome；

Correction module, for being corrected to the result of speech recognition according to the learning outcome.

Further, the scene determining module includes：

First determining unit, for carrying out speech recognition to the sound for detecting, judges that speech recognition is obtained belonging to language material The corresponding application scenarios of corpus；

Second determining unit, the position where for detecting mobile terminal by locating module obtains working as residing for user Preceding application scenarios；

3rd determining unit, the feature for detecting application scenarios by bluetooth digital signal processing appts, according to described Feature determines current application scene.

Further, described device also includes：

Feature extraction unit, for being grouped to the corpus under each application scenarios using clustering algorithm, according to institute The result for stating packet extracts language material feature；

Model creating unit, for being trained to the language material feature, creates the depth of each application scenarios of correspondence Practise model.

Further, the correction module includes：

Correction unit, if for result that the learning outcome is the speech recognition and current application scene not Match somebody with somebody, be corresponding result under current application scene by the calibration of the output results of the speech recognition.

Further, the corpus includes：

The language material of the user input for having stored, the language obtained by the language material and/or the result of correction speech recognition that screen Material.

Bearing calibration and the device of a kind of speech recognition are the embodiment of the invention provides, is determined by obtaining detection data Current application scene, the language material that speech recognition is obtained carries out depth in the corresponding deep learning model of current application scene Practise, pair be corrected with the result of the unmatched speech recognition of current application scene, replace with correct character translation result, energy Enough meet the requirement of application-specific scene speech recognition, with targetedly speech recognition is carried out to each application scenarios, significantly Improve the accuracy of speech recognition, and then promote man-machine interaction, make one with machine can effective communication exchange, improve Consumer's Experience sense, can have wide range of applications.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the bearing calibration of the speech recognition in the embodiment of the present invention one；

Fig. 2 is a kind of flow chart of the bearing calibration of the speech recognition in the embodiment of the present invention two；

Fig. 3 a are a kind of flow charts of the bearing calibration of the speech recognition in the embodiment of the present invention three；

Fig. 3 b are a kind of schematic diagrames of the bearing calibration of the speech recognition in the embodiment of the present invention three；

Fig. 4 is a kind of flow chart of the bearing calibration of the speech recognition in the embodiment of the present invention four；

Fig. 5 is a kind of structural representation of the means for correcting of the speech recognition in the embodiment of the present invention five.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part rather than entire infrastructure related to the present invention is illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 is a kind of flow chart of the bearing calibration of speech recognition that the embodiment of the present invention one is provided, and the present embodiment can be fitted Situation for being corrected the result of speech recognition according to current application scene, the method can be by a kind of speech recognition Means for correcting is performed, and the device can be realized by the way of software and/or hardware, be typically integrated in speech recognition work( In the equipment of energy.

The method of the embodiment of the present invention one is specifically included：

S101, the current application scene according to residing for the detection data of setting testing equipment determines user.

The language of China is of extensive knowledge and profound scholarship, speech recognition tool is carried out to Chinese and is acquired a certain degree of difficulty, even only one The difference of speech tone, even or even say that the tone of voice is identical, meaning to be expressed is exactly completely different, institute With, it is necessary to detect the current application scene at user, the application-specific used user according to different application scenarios Language material under scene is identified and judges, makes the final result of speech recognition more accurate.Can using setting testing equipment Current applied environment is detected, so that it is determined that the current application scene at user.

S102, the sound to detecting under the current application scene carry out speech recognition.

Specifically, after the current application scene at user is determined, the sound to detecting carries out voice knowledge Not, the result of speech recognition is obtained, that is, obtains the language material obtained by speech recognition.

S103, the language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carry out depth Degree study, obtains learning outcome.

Specifically, creating the deep learning model of each application scenarios of correspondence first, set up simulation human brain and be analyzed The neutral net of habit, the language material obtained to speech recognition carries out the study and analysis of depth, including semanteme, voice, intonation, linguistic context And grammer etc., whether the PRELIMINARY RESULTS and current application scenarios for judging speech recognition are matchings, judge that speech recognition is obtained To language material whether be accurate.

S104, the result of speech recognition is corrected according to the learning outcome.

Specifically, by after deep learning, if the language material that speech recognition is obtained is inaccurate, to speech recognition Result is corrected, and voice identification result is translated as into correct word, the voice identification result before replacing it.

In the present embodiment, it is first determined the current application scene at user, with reference to current application scene, voice is known The language material not obtained carries out deep learning, if the language material that speech recognition is obtained is inaccurate, according to the knot of deep learning Really, according to current application scene, the result to speech recognition is corrected.For example：The language material of user input is for " programmer is in electricity Code is write before brain ", the reason, the identification of big data speech engine such as the accent that may be sent due to user is nonstandard, word speed is too fast Result is " programmer writes aunt before computer ", and current application scenarios can be determined according to the vocabulary such as " programmer ", " computer " It is the operative scenario of programmer, depth is carried out by the recognition result to big data speech engine in deep learning model Practise, " aunt will be write " and be corrected to " writing code ", obtain correct voice identification result.

A kind of bearing calibration of speech recognition that the embodiment of the present invention one is provided, disclosure satisfy that application-specific scene voice is known Other requirement, with targetedly speech recognition is carried out to each application scenarios, greatly improves the accuracy of speech recognition, enters And promote man-machine interaction, make one with machine can effective communication exchange, improve Consumer's Experience sense, can have a wide range of application It is general.

Embodiment two

Fig. 2 is a kind of flow chart of the bearing calibration of speech recognition that the embodiment of the present invention two is provided, the embodiment of the present invention Two are optimized based on embodiment one, specifically to according to setting testing equipment detection data determine user residing for The operation of current application scene further optimizes, as shown in Fig. 2 the embodiment of the present invention two is specifically included：

S201, the sound to detecting carry out speech recognition, judge that speech recognition obtains the corpus correspondence belonging to language material Application scenarios.

Specifically, collecting and storing the corpus for having mapping relations with each application scenarios, corpus are all collections The set of the language material for arriving, according to the language material of user input, the sound to detecting carries out speech recognition, and with the content of corpus Compare, search and judge that speech recognition obtains the corresponding current application scene of corpus belonging to language material.Can lead to The keyword for collecting application-specific scene is crossed, the mapping relations of the keyword and its application scenarios are set up.For example, collecting dining room The language materials such as all common-use words, the menu name of scape, set up the mapping relations of the language material and dining room application scenarios.

S202, by locating module detect mobile terminal where position, obtain user residing for current application scene.

Specifically, the position in the mobile terminal that can be used by user where the module detection user with positioning function Put, the current application scene according at testing result determines user.Module with positioning function can be fixed using the whole world Position system (Global Positioning System, abbreviation GPS), bluetooth location technology and connection mobile data flow or WLAN carries out the positioning of current application scene by localization methods such as map software positioning.

S203, the feature that application scenarios are detected by bluetooth digital signal processing appts, are determined current according to the feature Application scenarios.

Specifically, the collection of current application scene signals is carried out using the sensor in bluetooth digital signal processing appts, According to the feature of collection signal detection application scenarios, for example, can judge to be by the temperature of temperature sensor detection environment Indoor environment or outdoor environment, the current application scene that user is in is determined with this.

In the present embodiment, the position at global positioning system positioning user can be used, for example：Navigate to user position In some dining room, then can be determined that current application scene for dining room, then the result of speech recognition should have with dining room scene Close.

What deserves to be explained is, above three method is used to determine current application scene, can be selected according to practical situations The method of any one or any two kinds or whole therein is selected to carry out the determination of current application scene.

S204, the sound to detecting under the current application scene carry out speech recognition.

S205, the language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carry out depth Degree study, obtains learning outcome.

S206, the result of speech recognition is corrected according to the learning outcome.

A kind of bearing calibration of speech recognition that the embodiment of the present invention two is provided, can accurately obtain at user Current application scene, speech recognition is targetedly carried out according to current application scene, improves the accuracy of speech recognition, lifting The actual interactive experience of user and product.

Embodiment three

Fig. 3 a are a kind of flow chart of the bearing calibration of speech recognition that the embodiment of the present invention three is provided, the embodiment of the present invention Three are optimized improvement based on the various embodiments described above, residing for determining user according to the detection data of setting testing equipment Current application scene before operation further illustrated, as shown in Figure 3 a, the method for the embodiment of the present invention three is specific Including：

S301, the corpus under each application scenarios is grouped using clustering algorithm, according to the result of the packet Extract language material feature.

Preferably, the corpus includes：The language material of the user input for having stored, the language material by screening and/or correction The language material that the result of speech recognition is obtained.

Specifically, corpus is used as the basic data in deep learning model, can be the user input for having stored Language material, and/or specialty voice technology business according to the language material screened by all kinds of topics, and/or to voice identification result Carry out phonetic synthesis, the language material that analysis and correction phonetic synthesis result are obtained.Use the clustering algorithms pair such as partitioning or stratification Corpus is grouped, and extracts every group of feature of language material.

S302, the language material feature is trained, creates the deep learning model of each application scenarios of correspondence.

Specifically, being input into corpus in a model, the feature of language material is trained by neutral net, simulation human brain The mode of thinking, creates the deep learning model for each application scenarios.For each language material, with reference to its application scenarios, sentence The accuracy of the result of disconnected its speech recognition.

S303, the current application scene according to residing for the detection data of setting testing equipment determines user.

S304, the sound to detecting under the current application scene carry out speech recognition.

S305, the language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carry out depth Degree study, obtains learning outcome.

S306, the result of speech recognition is corrected according to the learning outcome.

In the present embodiment, Fig. 3 b are a kind of schematic diagram of the bearing calibration of speech recognition that the embodiment of the present invention three is provided, With reference to Fig. 3 b, the positioning function of the mobile terminal that can be used by user, bluetooth digital signal processing appts and search defeated The matching application scenarios for entering language material obtain the current geographic position of user jointly, determine the current application scene at user. Classification language material that the user's language material that to store, voice technology business provide and the language material after being corrected to phonetic synthesis result Input to model is trained, and creates the deep learning model of each application scenarios of correspondence.By the voice of big data speech engine The result of identification is input into deep learning model, and according to current application scene, the result to speech recognition carries out error correction, and right Fallibility point is predicted, and the result to the speech recognition of mistake is corrected, and inherited error is replaced with correct translation result Translation result.

A kind of bearing calibration of speech recognition that the embodiment of the present invention three is provided, is made currently by creating deep learning model Application scenarios identification is more accurate, and the judgement of accuracy is carried out so as to the result to speech recognition, corrects inaccurate voice and knows Other result, improves the accuracy of speech recognition.

Example IV

Fig. 4 is a kind of flow chart of the bearing calibration of speech recognition that the embodiment of the present invention four is provided, the embodiment of the present invention Four are optimized improvement based on the various embodiments described above, to carrying out school to the result of speech recognition according to the learning outcome Positive operation is further illustrated, as shown in figure 4, the method for the embodiment of the present invention four is specifically included：

S401, the current application scene according to residing for the detection data of setting testing equipment determines user.

S402, the sound to detecting under the current application scene carry out speech recognition.

S403, the language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carry out depth Degree study, obtains learning outcome.

If S404, the result that the learning outcome is the speech recognition and current application scene are mismatched, will be described The calibration of the output results of speech recognition is corresponding result under current application scene.

Specifically, the result of the speech recognition of checking big data speech engine output and current application scenarios whether Match somebody with somebody, if mismatched, the result to speech recognition is corrected, and is corrected to the result matched with current application scene, and turn over Correct word is translated into, the result of inherited error is replaced.

A kind of bearing calibration of speech recognition that the embodiment of the present invention four is provided, pair knows with the unmatched voice of application scenarios Other result is corrected, and improves the accuracy of speech recognition and translation under application-specific scene, optimizes system logic.

Embodiment five

Fig. 5 is a kind of structural representation of the means for correcting of the speech recognition in the embodiment of the present invention five, the device application In correction and the unmatched voice identification result of application scenarios.As shown in figure 5, device includes：Scene determining module 501, voice Identification module 502, deep learning module 503 and correction module 504.

Scene determining module 501, for according to setting testing equipment detection data determine user residing for current application Scene；

Sound identification module 502, speech recognition is carried out for the sound to detecting under the current application scene；

Deep learning module 503, for based on the corresponding deep learning model of the current application scene to speech recognition The language material for obtaining carries out deep learning, obtains learning outcome；

Correction module 504, for being corrected to the result of speech recognition according to the learning outcome.

The embodiment of the present invention five determines current application scene by obtaining detection data, the language material that speech recognition is obtained Deep learning is carried out in the corresponding deep learning model of current application scene, pair is known with the unmatched voice of current application scene Other result is corrected, and replaces with correct character translation as a result, it is possible to meet the requirement of application-specific scene speech recognition, With targetedly speech recognition is carried out to each application scenarios, the accuracy of speech recognition is greatly improved, and then promoted Man-machine interaction, make one with machine can effective communication exchange, improve Consumer's Experience sense, can have wide range of applications.

On the basis of the various embodiments described above, the scene determining module 501 can include：

On the basis of the various embodiments described above, described device can also include：

On the basis of the various embodiments described above, the correction module 504 can include：

On the basis of the various embodiments described above, the corpus can include：

In the present embodiment, the application scenarios, the second determining unit that are matched with input language material are searched by the first determining unit The method in the geographical position and the 3rd determining unit detection application scenarios feature that position user determines in scene determining module The current application scene that user is in, in sound identification module, the sound to being detected under current application scene is identified, It is identified result.By the language material of stored user input, and/or specialty voice technology business according to by all kinds of topics The language material for screening, and/or phonetic synthesis is carried out to voice identification result, the language that analysis and correction phonetic synthesis result are obtained Material is input into model as the basic data of corpus and is trained, and creates the corresponding deep learning model of each application scenarios, In deep learning module, the language material obtained to speech recognition based on the corresponding deep learning model of current application scene carries out depth Degree study, if learning outcome is mismatched for the result of speech recognition with current application scene, in the correction list of correction module Unit is corrected to the result of speech recognition, and is translated as correct word, replaces original translation result.

A kind of means for correcting of speech recognition that the embodiment of the present invention five is provided, improves the accuracy of speech recognition, promotees Enter the effective communication of man-machine interaction, meanwhile, the logic of speech recognition system being improved, can have wide range of applications.

The executable any embodiment of the present invention of the means for correcting of speech recognition provided in an embodiment of the present invention provides voice and knows The method of other correction, possesses the corresponding functional module of execution method and beneficial effect.

Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of bearing calibration of speech recognition, it is characterised in that including：

The language material obtained to speech recognition based on the corresponding deep learning model of the current application scene carries out deep learning, obtains Take learning outcome；

2. method according to claim 1, it is characterised in that the detection data according to setting testing equipment determines to use Current application scene residing for family, including following at least one：

Sound to detecting carries out speech recognition, judges that speech recognition obtains the corresponding applied field of corpus belonging to language material Scape；

The feature of application scenarios is detected by bluetooth digital signal processing appts, current application scene is determined according to the feature.

3. method according to claim 1, it is characterised in that the detection data according to setting testing equipment determines to use Before current application scene residing for family, also include：

The corpus under each application scenarios is grouped using clustering algorithm, it is special that the result according to the packet extracts language material Levy；

4. method according to claim 1, it is characterised in that it is described according to the learning outcome to the result of speech recognition It is corrected, including：

If the learning outcome is the result of the speech recognition and current application scene mismatched, by the speech recognition Calibration of the output results is corresponding result under current application scene.

5. method according to claim 3, it is characterised in that the corpus includes：The language of the user input for having stored Material, the language material obtained by the language material and/or the result of correction speech recognition that screen.

6. a kind of means for correcting of speech recognition, it is characterised in that including：

Scene determining module, for according to setting testing equipment detection data determine user residing for current application scene；

Deep learning module, for the language obtained to speech recognition based on the corresponding deep learning model of the current application scene Material carries out deep learning, obtains learning outcome；

7. device according to claim 6, it is characterised in that the scene determining module includes：

First determining unit, for carrying out speech recognition to the sound for detecting, judges that speech recognition obtains the language belonging to language material The corresponding application scenarios of material collection；

Second determining unit, the position where for detecting mobile terminal by locating module obtains currently should residing for user Use scene；

3rd determining unit, the feature for detecting application scenarios by bluetooth digital signal processing appts, according to the feature Determine current application scene.

8. device according to claim 6, it is characterised in that described device also includes：

Feature extraction unit, for being grouped to the corpus under each application scenarios using clustering algorithm, according to described point The result of group extracts language material feature；

Model creating unit, for being trained to the language material feature, creates the deep learning mould of each application scenarios of correspondence Type.

9. device according to claim 6, it is characterised in that the correction module includes：

Correction unit, if mismatched for result and current application scene that the learning outcome is the speech recognition, will The calibration of the output results of the speech recognition is corresponding result under current application scene.

10. device according to claim 8, it is characterised in that the corpus includes：

The language material of the user input for having stored, the language material obtained by the language material and/or the result of correction speech recognition that screen.