CN109119073A

CN109119073A - Audio recognition method, system, speaker and storage medium based on multi-source identification

Info

Publication number: CN109119073A
Application number: CN201810673599.3A
Authority: CN
Inventors: 蔡洁荣
Original assignee: FLYBALL ELECTRONIC (SHENZHEN) Co Ltd
Current assignee: FLYBALL ELECTRONIC (SHENZHEN) Co Ltd
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2019-01-01

Abstract

The invention discloses a kind of audio recognition method, system, speaker and storage mediums based on multi-source identification, which comprises obtains user speech by intelligent sound box；The user speech that intelligent sound box will acquire identifies the user speech by least two speech recognition platforms, obtains at least two recognition results；Intelligent sound box obtains at least two recognition result, compares at least two recognition result that at least two speech recognition platforms identify；Intelligent sound box exports identical at least two recognition result；It is exported again after intelligent sound box is same at least two recognition result progress having differences.Row identifies when the present invention in intelligent sound box by being arranged at least two speech recognition platforms to user speech, it is exported when recognition result is identical, in recognition result difference, carries out obtaining final recognition result after sameization to be exported again, greatly improve intelligent sound box precision of identifying speech.

Description

Audio recognition method, system, speaker and storage medium based on multi-source identification

Technical field

The present invention relates to field of speech recognition more particularly to a kind of audio recognition methods based on multi-source identification, system, sound Case and storage medium.

Background technique

Speech recognition is the key technology by realizing human-computer interaction with the order of machine recognition user voice, can be shown The mode for improving human-computer interaction is write so that user can complete more multitask while saying order.Speech recognition is to pass through Speech recognition engine that online or off-line training obtains is realized.Speech recognition process can be generally divided into training stage and knowledge The other stage.In the training stage, the mathematical model being based on according to speech recognition engine statistically obtains sound from training data Learn model (acoustic model, AM) and vocabulary (lexicon).In the recognition stage, speech recognition engine uses acoustics Model and vocabulary handle the voice of input, obtain speech recognition result.For example, being carried out from the audiograph of input sound Then feature extraction obtains phoneme (such as [i], [o] etc.) sequence according to acoustic model, finally from vocabulary to obtain feature vector Middle positioning and the higher word of aligned phoneme sequence matching degree, even sentence.

In speech recognition system, more than one speech recognition engine may be loaded with and come while identifying same voice.Example Such as, the first speech recognition engine can be speaker's related voice identification (speaker-dependent automatic Speech recognition, SD-ASR) engine, it is trained to identify the voice from speaker dependent and export include pair The recognition result for the score answered.Second speech recognition engine can be the independent voice identification (speaker-independent that speaks Automatic speech recognition, SI-ASR) engine, it can identify the voice from any user and export packet Include the recognition result of corresponding score.

In the application of speech recognition, other than human-computer interaction, there are also the applications of social software, and user speech is converted It is exported for text, either human-computer interaction or social application, how to improve the precision of speech recognition is all a problem.

Summary of the invention

The purpose of the present invention is in view of the above-mentioned drawbacks of the prior art, providing a kind of voice knowledge based on multi-source identification Other method, system, speaker and storage medium.

The technical solution adopted by the present invention is that providing a kind of audio recognition method based on multi-source identification, the method packet It includes:

User speech is obtained by intelligent sound box；

The user speech that intelligent sound box will acquire knows the user speech by least two speech recognition platforms Not, at least two recognition results are obtained；

Intelligent sound box obtains at least two recognition result, compares at least two speech recognition platforms and identifies At least two recognition result arrived；

Intelligent sound box exports identical at least two recognition result；

It is exported again after intelligent sound box is same at least two recognition result progress having differences.

Preferably, the user speech that the intelligent sound box will acquire is by least two speech recognition platforms to the user Voice is identified, before obtaining at least two recognition results, the method also includes:

Intelligent sound box is provided with the different speech recognition platforms of at least two recognition strategies as at least two voice Identifying platform；

The vocal print of user is acquired and stored by intelligent sound box；

The user speech that will acquire is denoised.

The user speech is identified using at least two speech recognition platforms, speech recognition can be improved Precision, and the speech recognition platforms for selecting at least two recognition strategies different are as at least two described in the identification user speech A speech recognition platforms, under different recognition strategies, obtained recognition result precision is more guaranteed.It acquires and stores user's The vocal print of user is carried out speech recognition, available higher accuracy of identification as identification sample by vocal print.To user's language Sound is denoised, and makes source of sound be easier to be identified, while also improving accuracy of identification.

Preferably, the intelligent sound box wrap with exporting again and again at least two recognition result having differences It includes:

Difference section is distinguished in intelligent sound box, and context semantic analysis is used to the difference section；

It calls the convolutional Neural training pattern of cloud computing calculate the semanteme of at least two recognition result, determines it In one exported as recognition result.

Obtained at least two identification of the user speech is identified by least two speech recognition platforms As a result it is not necessarily the same, when at least two recognition result is not identical, which identification knot of output can not be determined Fruit.The convolutional Neural training pattern in cloud computing is called calculate the semanteme of at least two recognition result, to obtain The recognition result for meeting semantic habit in semantic base is exported, because recognition result meets semantic habit by model calculating, So the result precision of identification can be improved.

Select at least one corresponding second speech recognition engine of at least two speech recognition platforms to the user Voice is identified again, obtains multiple second recognition results；

The multiple recognition result and the multiple second recognition result are compared；

The highest recognition result of same rate is selected to be exported.

It for the recognition result having differences, is again identified that by the second speech engine, increases the number of identification, improved The precision of identification.

Difference section is distinguished, the difference section is searched for generally；

Selection is searched for the highest recognition result of matching degree generally and is exported.

By searching for generally to difference section, difference section is searched the highest content of matching degree and is replaced, and searches for Content semantically meeting habit, the precision of speech recognition equally can be improved.

A kind of speech recognition system based on multi-source identification is also provided, the system comprises:

Input module is arranged in intelligent sound box for obtaining user speech；

At least two speech recognition modules are arranged in intelligent sound box for identifying to the user speech, obtain At least two recognition results；

Contrast module is arranged in intelligent sound box for comparing the institute that at least two speech recognition module identifies State at least two recognition results；

Same module is arranged in intelligent sound box same for carrying out at least two recognition result having differences One；

Output module is arranged in intelligent sound box for exporting to same at least two recognition result.

Preferably, at least two speech recognition module is the different speech recognition module of at least two recognition strategies, The speech recognition module includes:

Submodule is stored, for storing the vocal print of collected user；

Submodule is denoised, for denoising to the user speech of acquisition.

Preferably, the same module includes:

Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing Calculate the semanteme of at least two recognition result；

Submodule is searched for, for searching for generally to difference section；

At least one second speech recognition submodule on the speech recognition module is set, for user's language Sound again identifies that, obtains multiple second recognition results.

A kind of intelligent sound box is also provided, the intelligent sound box includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute It states code set or described instruction collection is loaded by the processor and executed to realize that the voice as the aforementioned based on multi-source identification is known Other method.

A kind of computer readable storage medium is also provided, at least one instruction, at least one are stored in the storage medium Duan Chengxu, code set or instruction set, at least one instruction, at least one section of program, the code set or the described instruction Collection is loaded by the processor and is executed to realize the audio recognition method as the aforementioned based on multi-source identification.

Compared with prior art, the present invention at least has the advantages that the present invention by being arranged in intelligent sound box When at least two speech recognition platforms are to user speech row identify, exported when recognition result is identical, recognition result not Meanwhile carrying out obtaining final recognition result after sameization being exported again, greatly improve intelligent sound box precision of identifying speech.

Detailed description of the invention

Fig. 1 is the audio recognition method flow chart based on multi-source identification of the embodiment of the present invention；

Fig. 2 is a kind of flow chart that sameization of the embodiment of the present invention is handled；

Fig. 3 is another flow chart that sameization of the embodiment of the present invention is handled；

Fig. 4 is another flow chart that sameization of the embodiment of the present invention is handled；

Fig. 5 is the speech recognition system module map based on multi-source identification of the embodiment of the present invention.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and examples.

As shown in Figure 1, the invention proposes a kind of audio recognition method based on multi-source identification, it is described to be identified based on multi-source Audio recognition method implement a kind of speech recognition environment, the environment includes: terminal.Wherein, the terminal can be Intelligent sound box, smart phone, tablet computer, laptop and desktop computer etc., the present invention is not to the product class of the terminal Type does specific restriction.The terminal is equipped with social or human-computer interaction class application, and the social or human-computer interaction class application Terminal built-in microphone and display device can be called.

In embodiments of the present invention, the environment is preferably intelligent sound box, and setting can acquire language in the intelligent sound box The microphone and display screen of sound are additionally provided with speech identifying function module in intelligent sound box, certainly, as a kind of possible Implementation environment, the intelligent sound box can carry out network signal connection with the voice platform for providing speech-recognition services.

The described method includes:

11, user speech is obtained；Specifically, taking sound equipment to obtain user's language by being arranged in the intelligent sound box Sound, it is described that sound equipment is taken to can be the equipment that microphone etc. has sound collection.

Further, in order to obtain better source of sound, can in taking sound equipment settable denoising device, in the source of sound Head is denoised, and source quality is improved, to reduce the factor of interference speech recognition.The user speech can be turned by voice Parallel operation is converted to audio signal.

Further, the audio signal can state digital signal by conversion and be output to speech recognition platforms.

12, multi-platform speech recognition, the user speech that the intelligent sound box will acquire are flat by least two speech recognitions Platform identifies the user speech, obtains at least two recognition results；The multi-platform speech recognition 12 is to will acquire To user speech be sent to the step of different speech recognition platforms carry out speech recognitions, while the voice for obtaining multiple platforms is known Not as a result, comparing judgement to the speech recognition result of multiple platforms of acquisition.

Specifically, the multi-platform speech recognition can be realized in the intelligent sound box, it can be in the intelligent sound box In build multiple groups speech recognition engine, and semantic base is respectively configured, it should be noted that builds in the intelligent sound box is more In group speech recognition engine, the speech recognition engine of each group uses different recognition strategies, in the configuration of the semantic base, institute Combined strategy semantic in semantic base is stated also to be different.Such as difference can be the feature of speech recognition engine In extraction module, Mel cepstrum coefficient, or perception linear predictor coefficient can be used for the extracting method of speech feature vector It carries out；Difference can also be the foundation of acoustic model, such as use Hidden Markov Model-gauss hybrid models, or Convolutional neural networks or deep layer nerve net carry out.In semantic base on semantic combined strategy, difference can be language The emphasis of method, such as, semantic side focuses on the tense of verb, semantic emphasis in some semantic bases in some semantic bases It is the difference of nearly justice or unisonance, semantic side focuses on the integrality of syntactic structure in some semantic bases.

As a kind of possible embodiment, the multiple groups speech recognition engine and corresponding configuration built in the intelligent sound box Semantic base in, the speech recognition engine can using multiple semantic bases obtain speech recognition result, that is to say, that one group of language Sound identifies that engine can obtain multiple speech recognition results by multiple semantic bases.

Certainly, as alternatively possible embodiment, the multiple groups speech recognition engine built in the intelligent sound box and In the semantic base of corresponding configuration, it can also be multiple groups speech recognition engine by a semantic base and obtain multiple speech recognition results.

In addition, as another possible embodiment, the multiple groups speech recognition engine built in this intelligent sound box of writing And in the semantic base of corresponding configuration, when being identified every time to voice, can for one group of identification engine random fit one or Multiple semantic bases obtain speech recognition result.

Wherein, the output of institute's speech recognition result can use digital signal, convenient for comparison.

13, whether speech recognition result is identical, the intelligent sound box acquisition at least two recognition result, described in comparison At least two recognition result that at least two speech recognition platforms identify；If the speech recognition result of multiple platforms Identical, then it is accurate to represent speech recognition, is transferred to step 16, can be exported, the speech recognition platforms include: that speech recognition is drawn It holds up and semantic base, the speech recognition engine is made of voice recognition chip and its circuit, the speech recognition engine is according to language Sound signal matches the semanteme in semantic base, obtains the semanteme to match with the voice signal.

14, at least two recognition result that has differences carry out it is same after export again；If the voice of multiple platforms Recognition result is different, and different reasons occur may be that sound source is interfered, and platform can not accurately identify, it is also possible to platform Caused by recognition strategy is different, this when, needs were same to different recognition result progress, determined final recognition result again It is exported.

15, identical at least two recognition result is exported.Digital and analogue signals are additionally provided in the terminal to turn Parallel operation, the digital signal that speech recognition platforms are identified are converted to analog signal and export, and meet recognition result more The reading habit of people.

In embodiments of the present invention, at least two speech recognition platforms that pass through identify the user speech, Before obtaining at least two recognition results, the method also includes:

The speech recognition platforms for selecting at least two recognition strategies different are as at least two speech recognition platforms；It adopts The speech recognition result obtained with different recognition strategies, because the speech recognition engine used is different, the configuration of semantic base is not Together, the precision of speech recognition result is guaranteed, effectively increases the precision of speech recognition result.

Acquire and store the vocal print of user；Vocal print (Voiceprint) is the carrying speech letter that electricity consumption acoustic instrument is shown The sound wave spectrum of breath.Vocal print not only has specificity, but also has the characteristics of relative stability.After adult, the sound of people can be protected Hold for a long time stablize relatively it is constant.No matter talker is deliberately to imitate other people sound and the tone, or whisper in sb.'s ear is softly talked, even if mould Remarkably true to lifely imitative, vocal print is not but identical always.By vocal print characteristic, speech recognition platforms can be made to be easier to capture user's Voice band, to improve the precision of speech recognition.

The user speech that will acquire is denoised.From the collected user speech of the microphone of the terminal built-in, as Voice to be identified can generate noise because of the influence of external environment, also referred to as noise, can make to voice to be identified in acquisition At interference, to influence the precision of speech recognition, make the accuracy decline of speech recognition.

Specifically, can be denoised to voice to be identified to obtain more good source of sound, ambient sound can be carried out in advance Acquisition, converts digital audio and video signals for ambient sound by analog-digital converter, which can be used as reference signal, When denoising to voice to be identified, the digital signal of this part is eliminated.

As shown in Fig. 2, in embodiments of the present invention, described pair of at least two recognition result having differences carries out same It exports again and again, comprising steps of

21, difference section is distinguished, context semantic analysis is used to the difference section；Ignore difference section, in semanteme The front and back same section of the difference section of recognition result is matched in library, obtains similar semanteme.

22, model training calls the convolutional Neural training pattern of cloud computing to carry out calculating at least two recognition result Semanteme.

It before calling the convolutional Neural training pattern of cloud computing to be calculated, needs to be trained model, keeps convolution refreshing The predicted value to recognition result semanteme can be calculated rapidly through network.The predicted value replaces the difference section in recognition result, with The determining recognition result of identical part composition in recognition result.

23, it determines recognition result, determines that one of them is exported as recognition result.After digital analog converter is converted, The exportable display device in terminal of determining recognition result, it is of course also possible to directly export shape without being converted to analog signal At an instruction to terminal.

As shown in figure 3, as a kind of embodiment of the invention, described pair of at least two recognition result having differences Progress is same to be exported again and again, comprising the following steps:

31, the second speech recognition engine selects at least one corresponding second language of at least two speech recognition platforms Sound identification engine identifies the user speech again, obtains multiple second recognition results；Before being changed without semantic base It puts, the second speech recognition engine is set on same speech recognition platforms, precision of identifying speech can be improved, reduce voice and know Other engine bring physical influence.

32, recognition result compares, and the multiple recognition result and the multiple second recognition result are compared；More In a recognition result, compares and obtain the least recognition result of difference section, the as same highest recognition result of rate, then to same The highest recognition result of rate carries out semantic analysis, obtains determining recognition result.

33, it determines recognition result, the highest recognition result of same rate is selected to be exported.After digital analog converter is converted, The exportable display device in terminal of determining recognition result, it is of course also possible to directly export shape without being converted to analog signal At an instruction to terminal.

As shown in figure 4, as another embodiment of the invention, the described pair of at least two identification knot having differences Fruit progress is same to be exported again and again, comprising:

41, difference section is distinguished, semanteme identical before and after difference section is taken out as keyword.

42, it searches for generally, the difference section is searched for generally and replaced；By search key, to be counted It measures subject to most search results, fuzzy replacement is carried out to the difference section, obtain the recognition result for being best suitable for semantic habit As determining recognition result.

43, recognition result is determined, selection is searched for the highest recognition result of matching degree generally and exported.Through digital analog converter After conversion, the exportable display device in terminal of determining recognition result, it is of course also possible to direct without being converted to analog signal Output forms an instruction to terminal.

As a kind of possible embodiment, in order to being searched for generally faster as a result, can match for the intelligent sound box A retrieval semantic base is set, the semantic field for speech recognition is stored in the retrieval semantic base.

As shown in figure 5, also providing a kind of speech recognition system based on multi-source identification, the system is applied to an end End, the terminal can be smart phone, tablet computer, laptop and desktop computer etc., and the present invention is not to the terminal Product type do specific restriction.The terminal is equipped with social or human-computer interaction class application, and the social or man-machine friendship Mutual class application can call terminal built-in microphone and display device.

The system comprises:

Input module 51, for obtaining user speech；The input module 51 is the microphone for being built in the terminal.

At least two speech recognition modules 52 obtain at least two identification knots for identifying to the user speech Fruit；The speech recognition module 53 is the voice recognition chip being arranged in using cloud, and the terminal is provided with analog-digital converter, User speech is converted into audio signal.

The audio signal can also be converted into signal of communication, and the signal of communication is uploaded to cloud and is identified.

Certainly, the voice recognition chip in terminal also can be set in the speech recognition module 53.

Contrast module 54 identifies that obtain described at least two know for comparing at least two speech recognition module 52 Other result；The contrast module 54 is one for handling the processing chip of data.

Same module 55, it is same for being carried out at least two recognition result having differences；The same module 55 pairs of recognition results carry out same in digital level.

Output module 56, for being exported to same at least two recognition result.The output module setting 56 at the terminal, and digital and analogue signals converter is additionally provided in terminal, and the digital signal that speech recognition platforms are identified is converted It is exported for analog signal, recognition result is made more to meet the reading habit of people.

In embodiments of the present invention, at least two speech recognition module 52 is the different language of at least two recognition strategies Sound identification module 53, the speech recognition module 53 include:

Submodule is stored, for storing the vocal print of collected user；Cloud is being applied in the storage submodule setting, right The vocal print of user carries out cloud storage.

Submodule is denoised, for denoising to the user speech of acquisition.The denoising submodule, which can be set, to be applied Cloud carries out digital denoising to user speech.

In embodiments of the present invention, the same module 55 includes:

Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing Calculate the semanteme of at least two recognition result；Specifically, being carried out in the convolutional Neural training pattern for calling cloud computing It before calculating, needs to be trained model, convolutional neural networks is enable to calculate the predicted value to recognition result semanteme rapidly.It should Predicted value replaces the difference section in recognition result, the determining recognition result with part identical in recognition result composition.

Submodule is searched for as the embodiment of same module 55 a kind of in the present invention, for carrying out fuzzy search to difference section Rope；Semanteme identical before and after difference section is taken out as keyword.By search key, to obtain the most search of quantity As a result subject to, fuzzy replacement is carried out to the difference section, the recognition result for obtaining being best suitable for semantic habit is as determining knowledge Other result.

As the embodiment of same module 55 another in the present invention, it is arranged on the speech recognition module 53 at least One the second speech recognition submodule obtains multiple second recognition results for again identifying that the user speech.Not more Under the premise of changing semantic base, the second speech recognition submodule is set on same speech recognition platforms, speech recognition can be improved Precision reduces speech recognition module bring physical influence.By the multiple recognition result and the multiple second recognition result It compares；In multiple recognition results, comparison obtains the least recognition result of difference section, the as same highest identification of rate As a result, carrying out semantic analysis to the highest recognition result of same rate again, determining recognition result is obtained.

Certainly, three kinds of embodiments of the same module be can be simultaneous.

A kind of intelligent sound box is also provided, the server includes processor and memory, be stored in the memory to Few an instruction, at least one section of program, code set or instruction set, it is at least one instruction, at least one section of program, described Code set or described instruction collection are loaded by the processor and are executed to realize the speech recognition as the aforementioned based on multi-source identification Method.

Above-described embodiment is merely to illustrate a specific embodiment of the invention.It should be pointed out that for the general of this field For logical technical staff, without departing from the inventive concept of the premise, several deformations and variation can also be made, these deformations and Variation all should belong to protection scope of the present invention.

Claims

1. a kind of audio recognition method based on multi-source identification, is used for intelligent sound box, which is characterized in that the described method includes:

User speech is obtained by intelligent sound box；

The user speech that intelligent sound box will acquire identifies the user speech by least two speech recognition platforms, obtains To at least two recognition results；

Intelligent sound box obtains at least two recognition result, compares what at least two speech recognition platforms identified At least two recognition result；

Intelligent sound box exports identical at least two recognition result；

2. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box will obtain The user speech taken identifies the user speech by least two speech recognition platforms, obtains at least two identification knots Before fruit, the method also includes:

Intelligent sound box is provided with the different speech recognition platforms of at least two recognition strategies as at least two speech recognition Platform；

The vocal print of user is acquired and stored by intelligent sound box；

The user speech that will acquire is denoised.

3. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:

It calls the convolutional Neural training pattern of cloud computing calculate the semanteme of at least two recognition result, determines wherein one It is a to be exported as recognition result.

4. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:

Select at least one corresponding second speech recognition engine of at least two speech recognition platforms to the user speech It is identified again, obtains multiple second recognition results；

The highest recognition result of same rate is selected to be exported.

5. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:

6. a kind of speech recognition system based on multi-source identification, which is characterized in that the system comprises:

Input module is arranged in intelligent sound box for obtaining user speech；

Contrast module, be arranged in intelligent sound box for compare at least two speech recognition module identify described in extremely Few two recognition results；

Same module is arranged in intelligent sound box same for carrying out at least two recognition result having differences；

7. the speech recognition system as claimed in claim 6 based on multi-source identification, which is characterized in that at least two voice Identification module is the different speech recognition module of at least two recognition strategies, and the speech recognition module includes:

Submodule is stored, for storing the vocal print of collected user；

Submodule is denoised, for denoising to the user speech of acquisition.

8. the speech recognition system based on multi-source identification as claimed in claims 6 or 7, which is characterized in that the same module Include:

Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing to carry out Calculate the semanteme of at least two recognition result；

Submodule is searched for, for searching for generally to difference section；

At least one second speech recognition submodule on the speech recognition module is set, is used for the user speech again Secondary identification obtains multiple second recognition results.

9. a kind of intelligent sound box, which is characterized in that the intelligent sound box includes processor and memory, be stored in memory to Few an instruction, at least one section of program, code set or instruction set, it is at least one instruction, at least one section of program, described Code set or described instruction collection are loaded as the processor and are executed to realize the base as described in any one of claims 1 to 5 In the audio recognition method of multi-source identification.

10. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium Few one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or described Instruction set as the processor loads and execute with realize as described in any one of claims 1 to 5 based on multi-source identification Audio recognition method.