CN109119073A - Audio recognition method, system, speaker and storage medium based on multi-source identification - Google Patents

Audio recognition method, system, speaker and storage medium based on multi-source identification Download PDF

Info

Publication number
CN109119073A
CN109119073A CN201810673599.3A CN201810673599A CN109119073A CN 109119073 A CN109119073 A CN 109119073A CN 201810673599 A CN201810673599 A CN 201810673599A CN 109119073 A CN109119073 A CN 109119073A
Authority
CN
China
Prior art keywords
recognition
speech
sound box
intelligent sound
recognition result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810673599.3A
Other languages
Chinese (zh)
Inventor
蔡洁荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FLYBALL ELECTRONIC (SHENZHEN) Co Ltd
Original Assignee
FLYBALL ELECTRONIC (SHENZHEN) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FLYBALL ELECTRONIC (SHENZHEN) Co Ltd filed Critical FLYBALL ELECTRONIC (SHENZHEN) Co Ltd
Priority to CN201810673599.3A priority Critical patent/CN109119073A/en
Publication of CN109119073A publication Critical patent/CN109119073A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics

Abstract

The invention discloses a kind of audio recognition method, system, speaker and storage mediums based on multi-source identification, which comprises obtains user speech by intelligent sound box;The user speech that intelligent sound box will acquire identifies the user speech by least two speech recognition platforms, obtains at least two recognition results;Intelligent sound box obtains at least two recognition result, compares at least two recognition result that at least two speech recognition platforms identify;Intelligent sound box exports identical at least two recognition result;It is exported again after intelligent sound box is same at least two recognition result progress having differences.Row identifies when the present invention in intelligent sound box by being arranged at least two speech recognition platforms to user speech, it is exported when recognition result is identical, in recognition result difference, carries out obtaining final recognition result after sameization to be exported again, greatly improve intelligent sound box precision of identifying speech.

Description

Audio recognition method, system, speaker and storage medium based on multi-source identification
Technical field
The present invention relates to field of speech recognition more particularly to a kind of audio recognition methods based on multi-source identification, system, sound Case and storage medium.
Background technique
Speech recognition is the key technology by realizing human-computer interaction with the order of machine recognition user voice, can be shown The mode for improving human-computer interaction is write so that user can complete more multitask while saying order.Speech recognition is to pass through Speech recognition engine that online or off-line training obtains is realized.Speech recognition process can be generally divided into training stage and knowledge The other stage.In the training stage, the mathematical model being based on according to speech recognition engine statistically obtains sound from training data Learn model (acoustic model, AM) and vocabulary (lexicon).In the recognition stage, speech recognition engine uses acoustics Model and vocabulary handle the voice of input, obtain speech recognition result.For example, being carried out from the audiograph of input sound Then feature extraction obtains phoneme (such as [i], [o] etc.) sequence according to acoustic model, finally from vocabulary to obtain feature vector Middle positioning and the higher word of aligned phoneme sequence matching degree, even sentence.
In speech recognition system, more than one speech recognition engine may be loaded with and come while identifying same voice.Example Such as, the first speech recognition engine can be speaker's related voice identification (speaker-dependent automatic Speech recognition, SD-ASR) engine, it is trained to identify the voice from speaker dependent and export include pair The recognition result for the score answered.Second speech recognition engine can be the independent voice identification (speaker-independent that speaks Automatic speech recognition, SI-ASR) engine, it can identify the voice from any user and export packet Include the recognition result of corresponding score.
In the application of speech recognition, other than human-computer interaction, there are also the applications of social software, and user speech is converted It is exported for text, either human-computer interaction or social application, how to improve the precision of speech recognition is all a problem.
Summary of the invention
The purpose of the present invention is in view of the above-mentioned drawbacks of the prior art, providing a kind of voice knowledge based on multi-source identification Other method, system, speaker and storage medium.
The technical solution adopted by the present invention is that providing a kind of audio recognition method based on multi-source identification, the method packet It includes:
User speech is obtained by intelligent sound box;
The user speech that intelligent sound box will acquire knows the user speech by least two speech recognition platforms Not, at least two recognition results are obtained;
Intelligent sound box obtains at least two recognition result, compares at least two speech recognition platforms and identifies At least two recognition result arrived;
Intelligent sound box exports identical at least two recognition result;
It is exported again after intelligent sound box is same at least two recognition result progress having differences.
Preferably, the user speech that the intelligent sound box will acquire is by least two speech recognition platforms to the user Voice is identified, before obtaining at least two recognition results, the method also includes:
Intelligent sound box is provided with the different speech recognition platforms of at least two recognition strategies as at least two voice Identifying platform;
The vocal print of user is acquired and stored by intelligent sound box;
The user speech that will acquire is denoised.
The user speech is identified using at least two speech recognition platforms, speech recognition can be improved Precision, and the speech recognition platforms for selecting at least two recognition strategies different are as at least two described in the identification user speech A speech recognition platforms, under different recognition strategies, obtained recognition result precision is more guaranteed.It acquires and stores user's The vocal print of user is carried out speech recognition, available higher accuracy of identification as identification sample by vocal print.To user's language Sound is denoised, and makes source of sound be easier to be identified, while also improving accuracy of identification.
Preferably, the intelligent sound box wrap with exporting again and again at least two recognition result having differences It includes:
Difference section is distinguished in intelligent sound box, and context semantic analysis is used to the difference section;
It calls the convolutional Neural training pattern of cloud computing calculate the semanteme of at least two recognition result, determines it In one exported as recognition result.
Obtained at least two identification of the user speech is identified by least two speech recognition platforms As a result it is not necessarily the same, when at least two recognition result is not identical, which identification knot of output can not be determined Fruit.The convolutional Neural training pattern in cloud computing is called calculate the semanteme of at least two recognition result, to obtain The recognition result for meeting semantic habit in semantic base is exported, because recognition result meets semantic habit by model calculating, So the result precision of identification can be improved.
Preferably, the intelligent sound box wrap with exporting again and again at least two recognition result having differences It includes:
Select at least one corresponding second speech recognition engine of at least two speech recognition platforms to the user Voice is identified again, obtains multiple second recognition results;
The multiple recognition result and the multiple second recognition result are compared;
The highest recognition result of same rate is selected to be exported.
It for the recognition result having differences, is again identified that by the second speech engine, increases the number of identification, improved The precision of identification.
Preferably, the intelligent sound box wrap with exporting again and again at least two recognition result having differences It includes:
Difference section is distinguished, the difference section is searched for generally;
Selection is searched for the highest recognition result of matching degree generally and is exported.
By searching for generally to difference section, difference section is searched the highest content of matching degree and is replaced, and searches for Content semantically meeting habit, the precision of speech recognition equally can be improved.
A kind of speech recognition system based on multi-source identification is also provided, the system comprises:
Input module is arranged in intelligent sound box for obtaining user speech;
At least two speech recognition modules are arranged in intelligent sound box for identifying to the user speech, obtain At least two recognition results;
Contrast module is arranged in intelligent sound box for comparing the institute that at least two speech recognition module identifies State at least two recognition results;
Same module is arranged in intelligent sound box same for carrying out at least two recognition result having differences One;
Output module is arranged in intelligent sound box for exporting to same at least two recognition result.
Preferably, at least two speech recognition module is the different speech recognition module of at least two recognition strategies, The speech recognition module includes:
Submodule is stored, for storing the vocal print of collected user;
Submodule is denoised, for denoising to the user speech of acquisition.
Preferably, the same module includes:
Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing Calculate the semanteme of at least two recognition result;
Submodule is searched for, for searching for generally to difference section;
At least one second speech recognition submodule on the speech recognition module is set, for user's language Sound again identifies that, obtains multiple second recognition results.
A kind of intelligent sound box is also provided, the intelligent sound box includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute It states code set or described instruction collection is loaded by the processor and executed to realize that the voice as the aforementioned based on multi-source identification is known Other method.
A kind of computer readable storage medium is also provided, at least one instruction, at least one are stored in the storage medium Duan Chengxu, code set or instruction set, at least one instruction, at least one section of program, the code set or the described instruction Collection is loaded by the processor and is executed to realize the audio recognition method as the aforementioned based on multi-source identification.
Compared with prior art, the present invention at least has the advantages that the present invention by being arranged in intelligent sound box When at least two speech recognition platforms are to user speech row identify, exported when recognition result is identical, recognition result not Meanwhile carrying out obtaining final recognition result after sameization being exported again, greatly improve intelligent sound box precision of identifying speech.
Detailed description of the invention
Fig. 1 is the audio recognition method flow chart based on multi-source identification of the embodiment of the present invention;
Fig. 2 is a kind of flow chart that sameization of the embodiment of the present invention is handled;
Fig. 3 is another flow chart that sameization of the embodiment of the present invention is handled;
Fig. 4 is another flow chart that sameization of the embodiment of the present invention is handled;
Fig. 5 is the speech recognition system module map based on multi-source identification of the embodiment of the present invention.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and examples.
As shown in Figure 1, the invention proposes a kind of audio recognition method based on multi-source identification, it is described to be identified based on multi-source Audio recognition method implement a kind of speech recognition environment, the environment includes: terminal.Wherein, the terminal can be Intelligent sound box, smart phone, tablet computer, laptop and desktop computer etc., the present invention is not to the product class of the terminal Type does specific restriction.The terminal is equipped with social or human-computer interaction class application, and the social or human-computer interaction class application Terminal built-in microphone and display device can be called.
In embodiments of the present invention, the environment is preferably intelligent sound box, and setting can acquire language in the intelligent sound box The microphone and display screen of sound are additionally provided with speech identifying function module in intelligent sound box, certainly, as a kind of possible Implementation environment, the intelligent sound box can carry out network signal connection with the voice platform for providing speech-recognition services.
The described method includes:
11, user speech is obtained;Specifically, taking sound equipment to obtain user's language by being arranged in the intelligent sound box Sound, it is described that sound equipment is taken to can be the equipment that microphone etc. has sound collection.
Further, in order to obtain better source of sound, can in taking sound equipment settable denoising device, in the source of sound Head is denoised, and source quality is improved, to reduce the factor of interference speech recognition.The user speech can be turned by voice Parallel operation is converted to audio signal.
Further, the audio signal can state digital signal by conversion and be output to speech recognition platforms.
12, multi-platform speech recognition, the user speech that the intelligent sound box will acquire are flat by least two speech recognitions Platform identifies the user speech, obtains at least two recognition results;The multi-platform speech recognition 12 is to will acquire To user speech be sent to the step of different speech recognition platforms carry out speech recognitions, while the voice for obtaining multiple platforms is known Not as a result, comparing judgement to the speech recognition result of multiple platforms of acquisition.
Specifically, the multi-platform speech recognition can be realized in the intelligent sound box, it can be in the intelligent sound box In build multiple groups speech recognition engine, and semantic base is respectively configured, it should be noted that builds in the intelligent sound box is more In group speech recognition engine, the speech recognition engine of each group uses different recognition strategies, in the configuration of the semantic base, institute Combined strategy semantic in semantic base is stated also to be different.Such as difference can be the feature of speech recognition engine In extraction module, Mel cepstrum coefficient, or perception linear predictor coefficient can be used for the extracting method of speech feature vector It carries out;Difference can also be the foundation of acoustic model, such as use Hidden Markov Model-gauss hybrid models, or Convolutional neural networks or deep layer nerve net carry out.In semantic base on semantic combined strategy, difference can be language The emphasis of method, such as, semantic side focuses on the tense of verb, semantic emphasis in some semantic bases in some semantic bases It is the difference of nearly justice or unisonance, semantic side focuses on the integrality of syntactic structure in some semantic bases.
As a kind of possible embodiment, the multiple groups speech recognition engine and corresponding configuration built in the intelligent sound box Semantic base in, the speech recognition engine can using multiple semantic bases obtain speech recognition result, that is to say, that one group of language Sound identifies that engine can obtain multiple speech recognition results by multiple semantic bases.
Certainly, as alternatively possible embodiment, the multiple groups speech recognition engine built in the intelligent sound box and In the semantic base of corresponding configuration, it can also be multiple groups speech recognition engine by a semantic base and obtain multiple speech recognition results.
In addition, as another possible embodiment, the multiple groups speech recognition engine built in this intelligent sound box of writing And in the semantic base of corresponding configuration, when being identified every time to voice, can for one group of identification engine random fit one or Multiple semantic bases obtain speech recognition result.
Wherein, the output of institute's speech recognition result can use digital signal, convenient for comparison.
13, whether speech recognition result is identical, the intelligent sound box acquisition at least two recognition result, described in comparison At least two recognition result that at least two speech recognition platforms identify;If the speech recognition result of multiple platforms Identical, then it is accurate to represent speech recognition, is transferred to step 16, can be exported, the speech recognition platforms include: that speech recognition is drawn It holds up and semantic base, the speech recognition engine is made of voice recognition chip and its circuit, the speech recognition engine is according to language Sound signal matches the semanteme in semantic base, obtains the semanteme to match with the voice signal.
14, at least two recognition result that has differences carry out it is same after export again;If the voice of multiple platforms Recognition result is different, and different reasons occur may be that sound source is interfered, and platform can not accurately identify, it is also possible to platform Caused by recognition strategy is different, this when, needs were same to different recognition result progress, determined final recognition result again It is exported.
15, identical at least two recognition result is exported.Digital and analogue signals are additionally provided in the terminal to turn Parallel operation, the digital signal that speech recognition platforms are identified are converted to analog signal and export, and meet recognition result more The reading habit of people.
In embodiments of the present invention, at least two speech recognition platforms that pass through identify the user speech, Before obtaining at least two recognition results, the method also includes:
The speech recognition platforms for selecting at least two recognition strategies different are as at least two speech recognition platforms;It adopts The speech recognition result obtained with different recognition strategies, because the speech recognition engine used is different, the configuration of semantic base is not Together, the precision of speech recognition result is guaranteed, effectively increases the precision of speech recognition result.
Acquire and store the vocal print of user;Vocal print (Voiceprint) is the carrying speech letter that electricity consumption acoustic instrument is shown The sound wave spectrum of breath.Vocal print not only has specificity, but also has the characteristics of relative stability.After adult, the sound of people can be protected Hold for a long time stablize relatively it is constant.No matter talker is deliberately to imitate other people sound and the tone, or whisper in sb.'s ear is softly talked, even if mould Remarkably true to lifely imitative, vocal print is not but identical always.By vocal print characteristic, speech recognition platforms can be made to be easier to capture user's Voice band, to improve the precision of speech recognition.
The user speech that will acquire is denoised.From the collected user speech of the microphone of the terminal built-in, as Voice to be identified can generate noise because of the influence of external environment, also referred to as noise, can make to voice to be identified in acquisition At interference, to influence the precision of speech recognition, make the accuracy decline of speech recognition.
Specifically, can be denoised to voice to be identified to obtain more good source of sound, ambient sound can be carried out in advance Acquisition, converts digital audio and video signals for ambient sound by analog-digital converter, which can be used as reference signal, When denoising to voice to be identified, the digital signal of this part is eliminated.
The user speech is identified using at least two speech recognition platforms, speech recognition can be improved Precision, and the speech recognition platforms for selecting at least two recognition strategies different are as at least two described in the identification user speech A speech recognition platforms, under different recognition strategies, obtained recognition result precision is more guaranteed.It acquires and stores user's The vocal print of user is carried out speech recognition, available higher accuracy of identification as identification sample by vocal print.To user's language Sound is denoised, and makes source of sound be easier to be identified, while also improving accuracy of identification.
As shown in Fig. 2, in embodiments of the present invention, described pair of at least two recognition result having differences carries out same It exports again and again, comprising steps of
21, difference section is distinguished, context semantic analysis is used to the difference section;Ignore difference section, in semanteme The front and back same section of the difference section of recognition result is matched in library, obtains similar semanteme.
22, model training calls the convolutional Neural training pattern of cloud computing to carry out calculating at least two recognition result Semanteme.
It before calling the convolutional Neural training pattern of cloud computing to be calculated, needs to be trained model, keeps convolution refreshing The predicted value to recognition result semanteme can be calculated rapidly through network.The predicted value replaces the difference section in recognition result, with The determining recognition result of identical part composition in recognition result.
23, it determines recognition result, determines that one of them is exported as recognition result.After digital analog converter is converted, The exportable display device in terminal of determining recognition result, it is of course also possible to directly export shape without being converted to analog signal At an instruction to terminal.
Obtained at least two identification of the user speech is identified by least two speech recognition platforms As a result it is not necessarily the same, when at least two recognition result is not identical, which identification knot of output can not be determined Fruit.The convolutional Neural training pattern in cloud computing is called calculate the semanteme of at least two recognition result, to obtain The recognition result for meeting semantic habit in semantic base is exported, because recognition result meets semantic habit by model calculating, So the result precision of identification can be improved.
As shown in figure 3, as a kind of embodiment of the invention, described pair of at least two recognition result having differences Progress is same to be exported again and again, comprising the following steps:
31, the second speech recognition engine selects at least one corresponding second language of at least two speech recognition platforms Sound identification engine identifies the user speech again, obtains multiple second recognition results;Before being changed without semantic base It puts, the second speech recognition engine is set on same speech recognition platforms, precision of identifying speech can be improved, reduce voice and know Other engine bring physical influence.
32, recognition result compares, and the multiple recognition result and the multiple second recognition result are compared;More In a recognition result, compares and obtain the least recognition result of difference section, the as same highest recognition result of rate, then to same The highest recognition result of rate carries out semantic analysis, obtains determining recognition result.
33, it determines recognition result, the highest recognition result of same rate is selected to be exported.After digital analog converter is converted, The exportable display device in terminal of determining recognition result, it is of course also possible to directly export shape without being converted to analog signal At an instruction to terminal.
It for the recognition result having differences, is again identified that by the second speech engine, increases the number of identification, improved The precision of identification.
As shown in figure 4, as another embodiment of the invention, the described pair of at least two identification knot having differences Fruit progress is same to be exported again and again, comprising:
41, difference section is distinguished, semanteme identical before and after difference section is taken out as keyword.
42, it searches for generally, the difference section is searched for generally and replaced;By search key, to be counted It measures subject to most search results, fuzzy replacement is carried out to the difference section, obtain the recognition result for being best suitable for semantic habit As determining recognition result.
43, recognition result is determined, selection is searched for the highest recognition result of matching degree generally and exported.Through digital analog converter After conversion, the exportable display device in terminal of determining recognition result, it is of course also possible to direct without being converted to analog signal Output forms an instruction to terminal.
By searching for generally to difference section, difference section is searched the highest content of matching degree and is replaced, and searches for Content semantically meeting habit, the precision of speech recognition equally can be improved.
As a kind of possible embodiment, in order to being searched for generally faster as a result, can match for the intelligent sound box A retrieval semantic base is set, the semantic field for speech recognition is stored in the retrieval semantic base.
As shown in figure 5, also providing a kind of speech recognition system based on multi-source identification, the system is applied to an end End, the terminal can be smart phone, tablet computer, laptop and desktop computer etc., and the present invention is not to the terminal Product type do specific restriction.The terminal is equipped with social or human-computer interaction class application, and the social or man-machine friendship Mutual class application can call terminal built-in microphone and display device.
The system comprises:
Input module 51, for obtaining user speech;The input module 51 is the microphone for being built in the terminal.
At least two speech recognition modules 52 obtain at least two identification knots for identifying to the user speech Fruit;The speech recognition module 53 is the voice recognition chip being arranged in using cloud, and the terminal is provided with analog-digital converter, User speech is converted into audio signal.
The audio signal can also be converted into signal of communication, and the signal of communication is uploaded to cloud and is identified.
Certainly, the voice recognition chip in terminal also can be set in the speech recognition module 53.
Contrast module 54 identifies that obtain described at least two know for comparing at least two speech recognition module 52 Other result;The contrast module 54 is one for handling the processing chip of data.
Same module 55, it is same for being carried out at least two recognition result having differences;The same module 55 pairs of recognition results carry out same in digital level.
Output module 56, for being exported to same at least two recognition result.The output module setting 56 at the terminal, and digital and analogue signals converter is additionally provided in terminal, and the digital signal that speech recognition platforms are identified is converted It is exported for analog signal, recognition result is made more to meet the reading habit of people.
In embodiments of the present invention, at least two speech recognition module 52 is the different language of at least two recognition strategies Sound identification module 53, the speech recognition module 53 include:
Submodule is stored, for storing the vocal print of collected user;Cloud is being applied in the storage submodule setting, right The vocal print of user carries out cloud storage.
Submodule is denoised, for denoising to the user speech of acquisition.The denoising submodule, which can be set, to be applied Cloud carries out digital denoising to user speech.
In embodiments of the present invention, the same module 55 includes:
Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing Calculate the semanteme of at least two recognition result;Specifically, being carried out in the convolutional Neural training pattern for calling cloud computing It before calculating, needs to be trained model, convolutional neural networks is enable to calculate the predicted value to recognition result semanteme rapidly.It should Predicted value replaces the difference section in recognition result, the determining recognition result with part identical in recognition result composition.
Submodule is searched for as the embodiment of same module 55 a kind of in the present invention, for carrying out fuzzy search to difference section Rope;Semanteme identical before and after difference section is taken out as keyword.By search key, to obtain the most search of quantity As a result subject to, fuzzy replacement is carried out to the difference section, the recognition result for obtaining being best suitable for semantic habit is as determining knowledge Other result.
As the embodiment of same module 55 another in the present invention, it is arranged on the speech recognition module 53 at least One the second speech recognition submodule obtains multiple second recognition results for again identifying that the user speech.Not more Under the premise of changing semantic base, the second speech recognition submodule is set on same speech recognition platforms, speech recognition can be improved Precision reduces speech recognition module bring physical influence.By the multiple recognition result and the multiple second recognition result It compares;In multiple recognition results, comparison obtains the least recognition result of difference section, the as same highest identification of rate As a result, carrying out semantic analysis to the highest recognition result of same rate again, determining recognition result is obtained.
Certainly, three kinds of embodiments of the same module be can be simultaneous.
A kind of intelligent sound box is also provided, the server includes processor and memory, be stored in the memory to Few an instruction, at least one section of program, code set or instruction set, it is at least one instruction, at least one section of program, described Code set or described instruction collection are loaded by the processor and are executed to realize the speech recognition as the aforementioned based on multi-source identification Method.
A kind of computer readable storage medium is also provided, at least one instruction, at least one are stored in the storage medium Duan Chengxu, code set or instruction set, at least one instruction, at least one section of program, the code set or the described instruction Collection is loaded by the processor and is executed to realize the audio recognition method as the aforementioned based on multi-source identification.
Above-described embodiment is merely to illustrate a specific embodiment of the invention.It should be pointed out that for the general of this field For logical technical staff, without departing from the inventive concept of the premise, several deformations and variation can also be made, these deformations and Variation all should belong to protection scope of the present invention.

Claims (10)

1. a kind of audio recognition method based on multi-source identification, is used for intelligent sound box, which is characterized in that the described method includes:
User speech is obtained by intelligent sound box;
The user speech that intelligent sound box will acquire identifies the user speech by least two speech recognition platforms, obtains To at least two recognition results;
Intelligent sound box obtains at least two recognition result, compares what at least two speech recognition platforms identified At least two recognition result;
Intelligent sound box exports identical at least two recognition result;
It is exported again after intelligent sound box is same at least two recognition result progress having differences.
2. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box will obtain The user speech taken identifies the user speech by least two speech recognition platforms, obtains at least two identification knots Before fruit, the method also includes:
Intelligent sound box is provided with the different speech recognition platforms of at least two recognition strategies as at least two speech recognition Platform;
The vocal print of user is acquired and stored by intelligent sound box;
The user speech that will acquire is denoised.
3. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:
Difference section is distinguished in intelligent sound box, and context semantic analysis is used to the difference section;
It calls the convolutional Neural training pattern of cloud computing calculate the semanteme of at least two recognition result, determines wherein one It is a to be exported as recognition result.
4. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:
Select at least one corresponding second speech recognition engine of at least two speech recognition platforms to the user speech It is identified again, obtains multiple second recognition results;
The multiple recognition result and the multiple second recognition result are compared;
The highest recognition result of same rate is selected to be exported.
5. the audio recognition method as described in claim 1 based on multi-source identification, which is characterized in that the intelligent sound box is to depositing Export again and again together at least two recognition result of difference, comprising:
Difference section is distinguished, the difference section is searched for generally;
Selection is searched for the highest recognition result of matching degree generally and is exported.
6. a kind of speech recognition system based on multi-source identification, which is characterized in that the system comprises:
Input module is arranged in intelligent sound box for obtaining user speech;
At least two speech recognition modules are arranged in intelligent sound box for identifying to the user speech, obtain at least Two recognition results;
Contrast module, be arranged in intelligent sound box for compare at least two speech recognition module identify described in extremely Few two recognition results;
Same module is arranged in intelligent sound box same for carrying out at least two recognition result having differences;
Output module is arranged in intelligent sound box for exporting to same at least two recognition result.
7. the speech recognition system as claimed in claim 6 based on multi-source identification, which is characterized in that at least two voice Identification module is the different speech recognition module of at least two recognition strategies, and the speech recognition module includes:
Submodule is stored, for storing the vocal print of collected user;
Submodule is denoised, for denoising to the user speech of acquisition.
8. the speech recognition system based on multi-source identification as claimed in claims 6 or 7, which is characterized in that the same module Include:
Cloud computing submodule, it is semantic for analyzing difference section context, call the convolutional Neural training pattern of cloud computing to carry out Calculate the semanteme of at least two recognition result;
Submodule is searched for, for searching for generally to difference section;
At least one second speech recognition submodule on the speech recognition module is set, is used for the user speech again Secondary identification obtains multiple second recognition results.
9. a kind of intelligent sound box, which is characterized in that the intelligent sound box includes processor and memory, be stored in memory to Few an instruction, at least one section of program, code set or instruction set, it is at least one instruction, at least one section of program, described Code set or described instruction collection are loaded as the processor and are executed to realize the base as described in any one of claims 1 to 5 In the audio recognition method of multi-source identification.
10. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium Few one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or described Instruction set as the processor loads and execute with realize as described in any one of claims 1 to 5 based on multi-source identification Audio recognition method.
CN201810673599.3A 2018-06-25 2018-06-25 Audio recognition method, system, speaker and storage medium based on multi-source identification Pending CN109119073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810673599.3A CN109119073A (en) 2018-06-25 2018-06-25 Audio recognition method, system, speaker and storage medium based on multi-source identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810673599.3A CN109119073A (en) 2018-06-25 2018-06-25 Audio recognition method, system, speaker and storage medium based on multi-source identification

Publications (1)

Publication Number Publication Date
CN109119073A true CN109119073A (en) 2019-01-01

Family

ID=64822455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810673599.3A Pending CN109119073A (en) 2018-06-25 2018-06-25 Audio recognition method, system, speaker and storage medium based on multi-source identification

Country Status (1)

Country Link
CN (1) CN109119073A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767758A (en) * 2019-01-11 2019-05-17 中山大学 Vehicle-mounted voice analysis method, system, storage medium and equipment
CN110634481A (en) * 2019-08-06 2019-12-31 惠州市德赛西威汽车电子股份有限公司 Voice integration method for outputting optimal recognition result
CN110853635A (en) * 2019-10-14 2020-02-28 广东美的白色家电技术创新中心有限公司 Speech recognition method, audio annotation method, computer equipment and storage device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811915A (en) * 2005-01-28 2006-08-02 中国科学院计算技术研究所 Estimating and detecting method and system for telephone continuous speech recognition system performance
CN101807399A (en) * 2010-02-02 2010-08-18 华为终端有限公司 Voice recognition method and device
CN105810188A (en) * 2014-12-30 2016-07-27 联想(北京)有限公司 Information processing method and electronic equipment
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1811915A (en) * 2005-01-28 2006-08-02 中国科学院计算技术研究所 Estimating and detecting method and system for telephone continuous speech recognition system performance
CN101807399A (en) * 2010-02-02 2010-08-18 华为终端有限公司 Voice recognition method and device
CN105810188A (en) * 2014-12-30 2016-07-27 联想(北京)有限公司 Information processing method and electronic equipment
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767758A (en) * 2019-01-11 2019-05-17 中山大学 Vehicle-mounted voice analysis method, system, storage medium and equipment
CN109767758B (en) * 2019-01-11 2021-06-08 中山大学 Vehicle-mounted voice analysis method, system, storage medium and device
CN110634481A (en) * 2019-08-06 2019-12-31 惠州市德赛西威汽车电子股份有限公司 Voice integration method for outputting optimal recognition result
CN110634481B (en) * 2019-08-06 2021-11-16 惠州市德赛西威汽车电子股份有限公司 Voice integration method for outputting optimal recognition result
CN110853635A (en) * 2019-10-14 2020-02-28 广东美的白色家电技术创新中心有限公司 Speech recognition method, audio annotation method, computer equipment and storage device
CN110853635B (en) * 2019-10-14 2022-04-01 广东美的白色家电技术创新中心有限公司 Speech recognition method, audio annotation method, computer equipment and storage device

Similar Documents

Publication Publication Date Title
US10347244B2 (en) Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
US10726830B1 (en) Deep multi-channel acoustic modeling
US11043205B1 (en) Scoring of natural language processing hypotheses
CN110265040B (en) Voiceprint model training method and device, storage medium and electronic equipment
Ferrer et al. Study of senone-based deep neural network approaches for spoken language recognition
WO2020228173A1 (en) Illegal speech detection method, apparatus and device and computer-readable storage medium
US7966171B2 (en) System and method for increasing accuracy of searches based on communities of interest
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
US11081104B1 (en) Contextual natural language processing
CN110570853A (en) Intention recognition method and device based on voice data
CN111694940A (en) User report generation method and terminal equipment
CN106875936A (en) Audio recognition method and device
CN111433847A (en) Speech conversion method and training method, intelligent device and storage medium
CN110136721A (en) A kind of scoring generation method, device, storage medium and electronic equipment
CN110164416B (en) Voice recognition method and device, equipment and storage medium thereof
CN109119073A (en) Audio recognition method, system, speaker and storage medium based on multi-source identification
CN114818649A (en) Service consultation processing method and device based on intelligent voice interaction technology
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
Mirishkar et al. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection
CN110809796B (en) Speech recognition system and method with decoupled wake phrases
CN115641850A (en) Method and device for recognizing ending of conversation turns, storage medium and computer equipment
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115985320A (en) Intelligent device control method and device, electronic device and storage medium
CN112259077B (en) Speech recognition method, device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190101

WD01 Invention patent application deemed withdrawn after publication