CN109817220A

CN109817220A - Audio recognition method, apparatus and system

Info

Publication number: CN109817220A
Application number: CN201711147698.XA
Authority: CN
Inventors: 牛也; 徐巍越; 冯伟国; 黄光远
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2019-05-28
Also published as: TW201923736A; WO2019096056A1

Abstract

The embodiment of the present application provides a kind of audio recognition method, apparatus and system.Wherein, method includes: and receives voice to wake up word；Identify that voice wakes up the first dialect belonging to word；Service request is sent to server, with request server never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect；Voice signal to be identified is sent to server, recognition of speech signals is treated for the corresponding ASR model of the first dialect of server by utilizing and carries out speech recognition.Method provided in this embodiment automatically can carry out speech recognition to multi-party speech, improve the efficiency that speech recognition is carried out for multi-party speech.

Description

Audio recognition method, apparatus and system

Technical field

This application involves technical field of voice recognition more particularly to a kind of audio recognition methods, apparatus and system.

Background technique

Automatic speech recognition (Automatic Speech Recognition, ASR) is that one kind can be the voice of the mankind Audio signal is converted to the technology of content of text.With the development of software and hardware technology, the computing capability of various smart machines and deposit Storage capacity has great progress, so that speech recognition technology is widely applied in smart machine.

It in speech recognition technology, needs to accurately identify phoneme of speech sound, could be converted based on the phoneme of speech sound accurately identified For text.But whether which kind of language, the language can be all caused because of various factors, and there are many different pronunciations, i.e., in many ways Speech.By taking Chinese as an example, there are a variety of dialects such as Mandarin dialect, Shanxi language, Hunan language, Jiangxi language, the Wu dialect, Fujian language, Guangdong language, objective language, not Tongfang The pronunciation of speech differs greatly.

Currently, the speech recognition schemes for dialect are still immature, need to be directed to multi-party speech problem a kind of solution party is provided Case.

Summary of the invention

The many aspects of the application provide a kind of audio recognition method, apparatus and system, to automatically to multi-party speech Speech recognition is carried out, the efficiency for carrying out speech recognition for multi-party speech is improved.

The embodiment of the present application provides a kind of audio recognition method, is suitable for terminal device, this method comprises:

It receives voice and wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Service request is sent to server, to request the server to select institute in corresponding ASR model never with dialect State the corresponding ASR model of the first dialect；

Voice signal to be identified is sent to the server, so that the first dialect described in the server by utilizing is corresponding ASR model carries out speech recognition to the voice signal to be identified.

The embodiment of the present application also provides a kind of audio recognition method, is suitable for server, this method comprises:

The service request that receiving terminal apparatus is sent, the service request instruction select the corresponding ASR model of the first dialect；

Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect, first dialect are selected It is that the voice wakes up dialect belonging to word；

The voice signal to be identified that the terminal device is sent is received, and utilizes the corresponding ASR model of first dialect Speech recognition is carried out to the voice signal to be identified.

The embodiment of the present application also provides a kind of audio recognition method, is suitable for terminal device, this method comprises:

It receives voice and wakes up word；

The voice is sent to server and wakes up word, is never corresponded to dialect so that server is based on voice wake-up word ASR model in select the voice to wake up the corresponding ASR model of the first dialect belonging to word；

The voice that receiving terminal apparatus is sent wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected；

The embodiment of the present application also provides a kind of audio recognition method, comprising:

It receives voice and wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect；

Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition.

It receives voice and wakes up word, to wake up speech identifying function；

Receive the first voice signal with dialect indicative significance of user's input；

The first dialect for needing to carry out speech recognition is parsed from first voice signal；

The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component；

The memory, for storing computer program；

The processor is coupled with the memory, for executing the computer program, to be used for:

Voice, which is received, by the communication component wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Service request is sent to server by the communication component, to request the server corresponding never with dialect The corresponding ASR model of first dialect is selected in ASR model；

Voice signal to be identified is sent to the server by the communication component, for described in the server by utilizing The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified；

The communication component wakes up word for receiving the voice, Xiang Suoshu server send the service request and The voice signal to be identified.

The embodiment of the present application also provides a kind of server, comprising: memory, processor and communication component；

The memory, for storing computer program；

The service request sent by the communication component receiving terminal apparatus, the service request instruction selection first party Say corresponding ASR model；

The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes the first party Say that corresponding ASR model carries out speech recognition to the voice signal to be identified；

The communication component, for receiving the service request and the voice signal to be identified.

The memory, for storing computer program；

Voice, which is received, by the communication component wakes up word；

The voice is sent to server by the communication component and wakes up word, so that server is waken up based on the voice Word is never the same as selecting the voice to wake up the corresponding ASR model of the first dialect belonging to word in the corresponding ASR model of dialect；

The communication component wakes up word for receiving the voice, Xiang Suoshu server send the voice wake up word and The voice signal to be identified.

The memory, for storing computer program；

Word is waken up by the voice that the communication component receiving terminal apparatus is sent；

Identify that the voice wakes up the first dialect belonging to word；

The communication component wakes up word and the voice signal to be identified for receiving the voice.

The embodiment of the present application also provides a kind of electronic equipment, comprising: memory, processor and communication component；

The memory, for storing computer program；

Voice, which is received, by the communication component wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition；

The communication component wakes up word for receiving the voice.

The memory, for storing computer program；

Voice is received by the communication component and wakes up word, to wake up speech identifying function；

The first voice signal with dialect indicative significance of user's input is received by the communication component；

Voice signal to be identified is sent to the server by the communication component, for described in the server by utilizing The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified

The communication component wakes up word and first voice signal for receiving the voice, and to the service Device sends the service request and the voice signal to be identified.

The embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, the computer It can be realized the step in the first above-mentioned audio recognition method embodiment when program is computer-executed.

The embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, and feature exists In the computer program can be realized the step in above-mentioned second of audio recognition method embodiment when being computer-executed.

The embodiment of the present application also provides a kind of speech recognition system, including server and terminal device；

The terminal device wakes up word for receiving voice, identifies that the voice wakes up the first dialect belonging to word, and to The server sends service request, and sends voice signal to be identified, the service request instruction choosing to the server Select the corresponding ASR model of first dialect；

The server, it is never corresponding with dialect according to the instruction of the service request for receiving the service request ASR model in, select the corresponding ASR model of first dialect, and receive the voice signal to be identified, and utilize institute It states the first dialect corresponding ASR model and speech recognition is carried out to the voice signal to be identified.

The embodiment of the present application also provides a kind of speech recognition system, which is characterized in that including server and terminal device；

The terminal device wakes up word for receiving voice, and Xiang Suoshu server sends the voice and wakes up word, Yi Jixiang The server sends voice signal to be identified；

The server wakes up word for receiving the voice, identifies that the voice wakes up the first dialect belonging to word, from In the corresponding ASR model of different dialects, the corresponding ASR model of first dialect is selected, and receive the voice to be identified Signal, and speech recognition is carried out to the voice signal to be identified using first dialect corresponding ASR model.

In the embodiment of the present application, language is identified in advance in speech recognition process for different dialects building ASR model Sound wakes up dialect belonging to word, and then selection and voice wake up dialect pair belonging to word in corresponding ASR model never with dialect The ASR model answered carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes multi-party speech The automation of sound identification, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated without user, It implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:

Fig. 1 is a kind of structural schematic diagram for speech recognition system that one exemplary embodiment of the application provides；

Fig. 2 is a kind of flow diagram for audio recognition method that the application another exemplary embodiment provides；

Fig. 3 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides；

Fig. 4 is the structural schematic diagram for another speech recognition system that the application another exemplary embodiment provides；

Fig. 5 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides；

Fig. 6 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides；

Fig. 7 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides；

Fig. 8 is a kind of modular structure schematic diagram for speech recognition equipment that the application another exemplary embodiment provides；

Fig. 9 is a kind of structural schematic diagram for terminal device that the application another exemplary embodiment provides；

Figure 10 is the modular structure signal for another speech recognition equipment that the another exemplary embodiment of the application provides Figure；

Figure 11 is a kind of structural schematic diagram for server that the application another exemplary embodiment provides；

Figure 12 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides；

Figure 13 is the structural schematic diagram for another terminal device that the application another exemplary embodiment provides；

Figure 14 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides；

Figure 15 is the structural schematic diagram for another server that the application another exemplary embodiment provides；

Figure 16 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides；

Figure 17 is the structural schematic diagram for a kind of electronic equipment that the application another exemplary embodiment provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall in the protection scope of this application.

In the prior art, still immature for the speech recognition schemes of dialect, for the technical problem, the application is implemented Example provides a solution, and the main thought of the program is: ASR model is constructed for different dialects, in speech recognition process In, identify that voice wakes up dialect belonging to word in advance, and then selection and voice wake up word in corresponding ASR model never with dialect The corresponding ASR model of affiliated dialect carries out speech recognition to subsequent voice signal to be identified using selected ASR model, It realizes the automation of more dialect phonetic identifications, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, without using Family manual operation, implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.

Below in conjunction with attached drawing, the technical scheme provided by various embodiments of the present application will be described in detail.

Fig. 1 is a kind of structural schematic diagram for speech recognition system that one exemplary embodiment of the application provides.Such as Fig. 1 institute Show, which includes: server 101 and terminal device 102.Lead between server 101 and terminal device 102 Letter connection.

For example, terminal device 102 can be communicatively coupled by internet and server 101, or can also pass through Mobile network is communicatively coupled with server 101.If terminal device 102 is communicated by mobile network with server 101 Connection, the network formats of the mobile network can for 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), any one in 4G (LTE), 4G+ (LTE+), WiMax etc..

Server 101 is mainly directed towards different dialects and provides ASR model, and selects corresponding ASR model under corresponding dialect Voice signal carries out speech recognition.Server 101 can be it is any can provide calculating service, be able to respond service request, go forward side by side The equipment of row processing, such as can be General Server, Cloud Server, cloud host, virtual center etc..The composition of server is main It is similar with general computer architecture including processor, hard disk, memory, system bus etc..

In the present embodiment, terminal device 102 is mainly directed towards user, can provide a user the interface of speech recognition or enter Mouthful.There are many ways of realization of terminal device 102, for example, can be smart phone, intelligent sound box, PC, wearable device, Tablet computer etc..Terminal device 102 generally includes at least one processing unit and at least one processor.Processing unit and storage The quantity of device depends on the configuration and type of terminal device 102.Memory may include volatibility, such as RAM, also can wrap Non-volatile, such as read-only memory (Read-Only Memory, ROM), flash memory etc. are included, or can also simultaneously include two Seed type.Operating system (Operating System, OS), one or more application program are typically stored in memory, Also program data etc. has been can store.Other than processing unit and memory, terminal device 102 further includes some matching substantially It sets, such as network card chip, IO bus, audio-video component (such as microphone) etc..Optionally, terminal device 102 can also include Some peripheral equipments, such as keyboard, mouse, input pen, printer etc..These peripheral equipments be in the art it is generally known that , this will not be repeated here.

In the present embodiment, terminal device 102 and server 101 cooperate, and can provide a user speech recognition function Energy.Furthermore, it is contemplated that in some cases, terminal device 102 can be used by multiple users, multiple users may hold not Tongfang Speech.May include following a few class dialects with region zones by taking Chinese as an example: Mandarin dialect, Shanxi language, Hunan language, Jiangxi language, the Wu dialect, Fujian language, Guangdong language, objective language.Further, some dialects can also segment, such as Fujian language may include Northern Fujian Dialect, the south of Fujian Province words, Min Dong words, Fujian again Middle words, Pu's celestial being words etc..The pronunciation of different dialects differs greatly, and can not carry out speech recognition with same ASR model.Therefore, at this In embodiment, ASR model is constructed respectively for different dialects, in order to carry out speech recognition to different dialects.In turn, based on eventually Mutual cooperation between end equipment 102 and server 101 can provide speech identifying function to the user for holding different dialects, i.e., Speech recognition can be carried out to the voice signal for the user for holding different dialects.

In order to improve audio identification efficiency, terminal device 102 supports voice to wake up word function, i.e., when user wants to carry out language When sound identifies, voice can be inputted to terminal device 102 and wake up word, to wake up speech identifying function.The voice wakes up word and refers to Determine the voice signal of content of text, such as can be " unlatching ", " day cat is smart ", " hello " etc..Terminal device 102, which receives, to be used The voice of family input wakes up word, identifies that the voice wakes up dialect belonging to word, and then can determine subsequent voice signal institute to be identified The dialect (i.e. the voice wakes up dialect belonging to word) of category, mentions to carry out speech recognition using the corresponding ASR model of corresponding dialect For basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.Wherein, voice wakes up word institute The first dialect belonged to can be any dialect under any languages.

Terminal device 102 can send to server 101 and take after identifying that voice wakes up the first dialect belonging to word Business request, service request instruction server 101 never select the corresponding ASR of the first dialect with dialect in corresponding ASR model Model.The service request that 101 receiving terminal apparatus 102 of server is sent, later according to the instruction of the service request never Tongfang It says in corresponding ASR model, selects the corresponding ASR model of the first dialect, to be based on the corresponding ASR model of the first dialect to rear Continue voice signal to be identified and carries out speech recognition.In the present embodiment, it is corresponding to be previously stored with different dialects for server 101 ASR model.ASR model is a kind of model that voice signal can be converted to text.Optionally, a kind of dialect is one corresponding ASR model or several similar dialects can also correspond to same ASR model, not limit this.Wherein, the first dialect pair The ASR model answered is used to the voice signal of the first dialect being converted to content of text.

Terminal device 102 continues to send voice to be identified to server 101 after sending service request to server 101 Signal, the voice signal to be identified belong to the first dialect.The voice to be identified letter that 101 receiving terminal apparatus 102 of server is sent Number, and recognition of speech signals is treated according to the corresponding ASR model of the first dialect of selection and carries out speech recognition, it not only can be to the One dialect carries out speech recognition, and is conducive to improve the accuracy of speech recognition using matching ASR model.

Optionally, voice signal to be identified can be user after input voice wakes up word, continue defeated to terminal device 102 The voice signal entered is based on this, and terminal device 102 can also receive before sending voice signal to be identified to server 101 The voice signal to be identified of user's input.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in terminal device 102 local voice signals, are based on this, and terminal device 102 directly can obtain voice signal to be identified from local.

In some exemplary embodiments, server 101 can return to speech recognition result or voice to terminal device 102 The related information of recognition result.For example, the content of text that speech recognition goes out can be returned to terminal device 102 by server 101； Alternatively, the information such as the song to match with speech recognition result, video can also be returned to terminal device 102 by server 101. Terminal device 102 receives the related information of the speech recognition result that server 101 returns or speech recognition result, and according to voice Recognition result or the related information of speech recognition result execute subsequent processing.For example, terminal device 102 is receiving speech recognition After content of text out, content of text can be showed to user, or web search etc. can be carried out based on content of text. It, can be in another example terminal device 102 is after the information such as the related information for receiving speech recognition result, such as song, video The information such as song, video are played, or the information such as song, video can also be transmitted to other users, to realize information point It enjoys.

In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.

Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect ASR model, further increase the efficiency identified to more dialect phonetics.

In each embodiment of the application, does not limit terminal device 102 and identify that voice wakes up the first dialect belonging to word Mode, all modes that may recognize that the first dialect belonging to voice wake-up word are suitable for each embodiment of the application.In this Shen It please enumerate several terminal devices 102 below in some exemplary embodiments and identify that voices wake up the mode of the affiliated dialects of word:

Voice is waken up word and wakes up word progress acoustics from the benchmark recorded with different dialects respectively by mode 1, terminal device 102 The Dynamic Matching of feature obtains the corresponding dialect of benchmark wake-up word for meeting the first sets requirement with the matching degree that voice wakes up word As the first dialect.

In mode 1, benchmark is recorded with different dialects in advance and wakes up word.Wherein, it is waken up with the benchmark that different dialects are recorded Word is identical as the voice wake-up content of text of word.Since the sound generating mechanism for holding the user of different dialects is different, with the record of different dialects The acoustic feature of the benchmark keyword of system is different.Based on this, terminal device 102 prerecords benchmark with different dialects and wakes up word, It is waiting receive user input voice wake up word after, by voice wake up word respectively from is recorded with different dialects benchmark wake-up word into The Dynamic Matching of row acoustic feature, to obtain waking up the matching degree of word from different benchmark.Wherein, not according to application scenarios Together, the first sets requirement can be different.For example, the highest benchmark of the matching degree for waking up word with voice can be waken up corresponding to word Dialect as the first dialect；Alternatively, a matching degree threshold value also can be set, the matching degree that word is waken up with voice is greater than and is matched The benchmark for spending threshold value wakes up dialect corresponding to word as the first dialect；Or a matching degree range also can be set, it will be with language The matching degree that sound wakes up word falls into the benchmark within the scope of the matching degree and wakes up dialect corresponding to word as the first dialect.

In mode 1, acoustic feature can be presented as the temporal signatures and frequency domain character of voice signal.Based on temporal signatures With there are many matching process of frequency domain character, optionally, can based on dynamic time bending (dynamic time warping, DTW) method wakes up the Dynamic Matching that word carries out time series to voice.

Dynamic time bending method is the method for the similarity between a kind of two time serieses of measurement.Terminal device 102 Word is waken up according to the voice of input and generates the time series that voice wakes up word, and is waken up respectively from the benchmark recorded with different dialects The time series of word compares.Between two time serieses for participating in comparing, at least a pair of of similitude is determined.It will be between similitude Sum of the distance, i.e. consolidation path distance, to measure the similitude between two time serieses.It is alternatively possible to will be with voice The smallest benchmark of regular path distance for waking up word wakes up dialect corresponding to word as the first dialect；Also a distance can be set The benchmark that the regular path distance that word is waken up with voice is less than distance threshold is waken up dialect corresponding to word as first by threshold value Dialect；One distance range can also be set, the regular path distance for waking up word with voice is fallen into the benchmark in the distance range Dialect corresponding to word is waken up as the first dialect.

Mode 2, terminal device 102 identify that voice wakes up the acoustic feature of word, and the acoustic feature that voice wakes up word is distinguished It is matched from the acoustic feature of different dialects, the matching degree for obtaining the acoustic feature for waking up word with voice meets the second setting and wants The dialect asked is as the first dialect.

In mode 2, the acoustic feature of different dialects is obtained in advance, and the acoustic feature of word is waken up by identification voice, into And determine that voice wakes up the first dialect belonging to word based on the matching between acoustic feature.

Optionally, identification voice wake up word acoustic feature before, can to voice wake up word be filtered and Digitlization.Filtering processing refers to that retaining voice wakes up signal of the frequency in 300~3400Hz in word.Digitlization refers to the letter to reservation Number carry out A/D conversion and anti-aliasing processing.

It is alternatively possible to the spectrum signature parameter of word be waken up by calculating voice, such as sliding difference cepstrum parameter, to know Other voice wakes up the acoustic feature of word.Similar with mode 1, according to the difference of application scenarios, the second sets requirement can be different.Example Such as, the highest benchmark of the matching degree for the acoustic feature that word can will be waken up with voice wakes up dialect corresponding to word as first party Speech；Also a matching degree threshold value can be set, the matching degree that the acoustic feature of word will be waken up with voice is greater than the base of matching degree threshold value Standard wakes up dialect corresponding to word as the first dialect；One matching degree range can also be set, the acoustics of word will be waken up with voice The matching degree of feature falls into the benchmark within the scope of the matching degree and wakes up dialect corresponding to word as the first dialect.

Wherein, sliding difference cepstrum parameter is made of several pieces of difference cepstrums across multiframe voice, it is contemplated that before and after frames are poor The influence for pouring in separately spectrum has incorporated more temporal aspect.It compares benchmark and wakes up the sliding difference cepstrum parameter of word and with not Tongfang The sliding difference cepstrum parameter that the benchmark that speech is recorded wakes up word optionally will wake up the sliding difference cepstrum parameter of word with benchmark The highest benchmark of matching degree wakes up dialect corresponding to word as the first dialect；Also a parameter difference threshold value can be set, it will be with base The voice that the difference that standard wakes up the sliding difference cepstrum parameter of word is less than parameter difference threshold value wakes up dialect corresponding to word as first Dialect；One parameter difference range can also be set, by and the difference of the benchmark sliding difference cepstrum parameter that wakes up word fall into the parameter difference Benchmark in range wakes up dialect corresponding to word as the first dialect.

Mode 3, is converted into text for voice wake-up word and wakes up word, and text is waken up word base corresponding from different dialects respectively Quasi- text wakes up word and is matched, and obtains the benchmark text wake-up word that the matching degree for waking up word with text meets third sets requirement Corresponding dialect is as the first dialect.

In mode 3, it is the text that voice wakes up that word is converted into after speech recognition that text, which wakes up word, and different dialects are corresponding Benchmark text to wake up word be that the corresponding benchmark of different dialects wakes up the text being converted into after word speech recognition.Optionally, for Text wakes up word and the corresponding benchmark text of different dialects wakes up word, can carry out rough language using identical speech recognition modeling Sound identification, to improve the efficiency of entire speech recognition process.Alternatively, can also be preparatory using the corresponding ASR model of different dialects It word is waken up to the corresponding benchmark of different dialects carries out being converted to corresponding benchmark text after speech recognition and wake up word, when receiving language After sound wakes up word, a kind of corresponding ASR model of dialect can be successively selected, and wake up to voice based on selected ASR model Word carries out speech recognition to obtain text and wake up word, and the text after conversion is waken up word benchmark text corresponding with this kind of dialect It wakes up word to be matched, if the corresponding benchmark text of this kind of dialect wakes up word and the matching degree of text wake-up word meets third setting It is required that then using this kind of dialect as the first dialect.Conversely, if the corresponding benchmark text of this kind of dialect wakes up word and text wakes up word Matching degree do not meet third sets requirement, then continue to wake up word to text according to a kind of corresponding ASR model of lower dialect carrying out Text is converted to after speech recognition and wakes up word, and the text after conversion is waken up into word benchmark text corresponding with this kind of dialect and is waken up Word is matched, and the benchmark text that third sets requirement is met until obtaining the matching degree for waking up word with text wakes up word, and will Benchmark text wakes up the corresponding dialect of word as voice and wakes up the first dialect belonging to word.

Optionally, similar with mode 1, mode 2, the highest benchmark text of the matching degree for waking up word with text can be waken up Dialect corresponding to word is as the first dialect；Also a matching degree threshold value can be set, the matching degree for waking up word with text is greater than The benchmark text of matching degree threshold value wakes up dialect corresponding to word as the first dialect；One matching degree range can also be set, it will The matching degree for waking up word with text falls into the benchmark text within the scope of the matching degree and wakes up dialect corresponding to word as first party Speech.

It is worth noting that the first sets requirement, the second sets requirement and third sets requirement can be identical, it can also not Together.

In some exemplary embodiments, terminal device 102 is that mobile phone, computer, wearable device etc. have setting for display screen It is standby, then a voice input interface can be shown on a display screen, and the text information of user's input is obtained by voice input interface And/or voice signal.Optionally, when user needs to carry out speech recognition, the unlatching key of pressing terminal device can be passed through Perhaps the modes such as the display screen of terminal device 102 are touched and send the instruction opened or activated to terminal device 102.Terminal is set Standby 102 may be in response to the instruction for being activated or switched on itself, show voice input interface to user on a display screen.Optionally, language The icon of microphone or the text information of similar " waking up word input " can be shown on sound input interface, to indicate that user inputs Voice wakes up word.In turn, the voice that terminal device 102 can obtain user's input based on voice input interface wakes up word.

In some exemplary embodiments, terminal device 102, which can be mobile phone, computer, intelligent sound box etc. and have voice, broadcasts The equipment of playing function.Based on this, terminal device 102 is after sending service request to server 101, and to server Before 101 send voice signal to be identified, voice input prompt information, such as the languages such as " please speak ", " asking program request " can be exported Sound signal, to prompt user to carry out voice input.It for users, can be defeated in the voice after input voice wakes up word Under the prompt for entering prompt tone, voice signal to be identified is inputted to terminal device 102.Terminal device 102 receive user input to Voice signal to be identified is sent to server 101 by recognition of speech signals, by server 101 according to the corresponding ASR of the first dialect Model treats recognition of speech signals and carries out speech recognition.

In other exemplary embodiments, terminal device 102 can be mobile phone, computer, wearable device etc. and have display The equipment of screen.Based on this, terminal device 102 is sent out after sending service request to server 101, and to server 101 It before sending voice signal to be identified, can show that voice inputs prompt information in a manner of text or icon etc., such as similar " please say Text, the microphone icon etc. of words ", to prompt user to carry out voice input.For users, input voice wake up word it Afterwards, voice signal to be identified can be inputted to terminal device 102 under the prompt that the voice inputs prompt information.Terminal device 102 receive the voice signal to be identified of user's input, voice signal to be identified are sent to server 101, by server 101 Recognition of speech signals, which is treated, according to the corresponding ASR model of the first dialect carries out speech recognition.

In other exemplary embodiment, terminal device 102 can have indicator light.Based on this, terminal device 102 exists After sending service request to server 101, and before sending voice signal to be identified to server 101, it can light Indicator light, to prompt user to carry out voice input.It for users, can be in the indicator light after input voice wakes up word Prompt under, input voice signal to be identified to terminal device 102.Terminal device 102 receives the voice to be identified of user's input Voice signal to be identified is sent to server 101 by signal, is treated by server 101 according to the corresponding ASR model of the first dialect Recognition of speech signals carries out speech recognition.

It is worth noting that terminal device 102 can be provided simultaneously with voice play function, indicator light, in display screen extremely It is two kinds or three kinds few.Based on this, terminal device 102 by audible, in a manner of text or icon and can be lighted simultaneously Two or three in the mode of indicator light, output voice inputs prompt information, to reinforce the interaction effect with user.

In some exemplary embodiments, terminal device 102 is mentioned in output voice input prompt tone or output voice input Before showing information or lighting indicator light, it may be predetermined that server 101 selected the corresponding ASR model of the first dialect, so as to It can be directly according to selected in the server 101 after the voice signal to be identified for inputting user is sent to server 101 ASR model is treated recognition of speech signals and is identified.Based on this, server 101 selects in ASR model corresponding never with dialect After selecting the corresponding ASR model of the first dialect, notification message is returned to terminal device 102, which, which is used to indicate, has selected Select the corresponding ASR model of the first dialect.Based on this, terminal device 102 can also receive the notification message of the return of server 101, And then know that server 101 selected the corresponding ASR model of the first dialect based on the notification message.In turn, terminal device 102 exists After the notification message for receiving the return of server 101, voice input prompt tone, or output voice input prompt letter can be exported Breath, or indicator light is lighted, to prompt user to carry out voice input.

In each embodiment of the application, server 101 needs to construct before selecting the corresponding ASR model of the first dialect The corresponding ASR model of different dialects.Wherein, the process that server 101 constructs the corresponding ASR model of different dialects specifically includes that Collect the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, to obtain the acoustic feature of different dialects；According to The acoustic feature of different dialects constructs the corresponding ASR model of different dialects.About the corresponding ASR model of every kind of dialect of building Detailed process can be found in the prior art, and details are not described herein.

It is alternatively possible to pass through the corpus of network collection difference dialect, or can also be to a large amount of use for holding different dialects Family carries out voice recording, to obtain the corpus of different dialects.

It optionally, can be to the language for the different dialects being collected into before the corpus to different dialects carries out feature extraction Material is pre-processed.Preprocessing process includes carrying out preemphasis processing, windowing process, endpoint detection processing to voice.To difference After the corpus of dialect is pre-processed, feature extraction can be carried out to voice.The feature of voice includes that temporal signatures and frequency domain are special Sign.Wherein, temporal signatures include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc., frequency domain character packet Include linear predictor coefficient, LPC cepstrum coefficient, line spectrum pairs parameter, short-term spectrum, Mel frequency cepstral coefficient etc..

In the following, illustrating the process that acoustic feature extracts for extracting Mel frequency cepstral coefficient.First with human ear Several bandpass filters are arranged in perception characteristics in the spectral range of voice, and each bandpass filter is with triangle or just Then string shape filtering characteristic is included in energy information in the characteristic vector that bandpass filter is filtered corpus, calculate The signal energy of several bandpass filters, then Mel frequency cepstral coefficient is calculated by discrete cosine transform.

After the acoustic feature for obtaining different dialects, using the acoustic feature of different dialects as input, with different dialects For the corresponding text of corpus as exporting, the parameter in the corresponding initial model of the different dialects of training is corresponding to obtain different dialects ASR model.Optionally, ASR model includes but is not limited to model, the neural network model etc. constructed based on vector quantization method.

Below by taking the application scenarios that multiple user's using terminal equipment for holding different dialects are requested a song as an example, to above-mentioned reality Example is applied to be described in detail.

The terminal device for having song ordering function can be intelligent sound box, and optionally, which has a display screen, It is " hello " that the preset voice of the intelligent sound box, which wakes up word,.When the Guangdong language user for holding Cantonise dialect wants requesting song, Guangdong language user First touch display screen with input activate the intelligent sound box instruction, intelligent sound box in response to activated terminals equipment instruction, Voice input interface is shown on display screen, shows " hello " text on voice input interface.Guangdong language user inputs boundary to voice Face inputs the voice signal of " hello ".Intelligent sound box obtains the voice letter of " hello " of user's input based on voice input interface Number, and identify that " hello " belongs to Cantonise dialect；Then, service request is sent to server, with the never same dialect of request server The corresponding ASR model of Cantonise dialect is selected in corresponding ASR model.After server receives service request, Cantonise dialect is selected Corresponding ASR model, and notification message is returned to intelligent sound box, it is corresponding which is used to indicate selected Cantonise dialect ASR model.Then, intelligent sound box output voice inputs prompt information, such as " please input voice ", to prompt user to carry out voice Input.Guangdong language user inputs the voice signal of song title " Five-Starred Red Flag (the national flag of the People's Republic of China) " under the prompt of voice input prompt information.Intelligent sound Case receives the voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " of Guangdong language user input, and voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " is sent to server.Service Device carries out speech recognition to voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " using the corresponding ASR model of Cantonise dialect to obtain text information " five The song to match with " Five-Starred Red Flag (the national flag of the People's Republic of China) " is issued to intelligent sound box by star red flag ", so that intelligent sound box plays the song.

Similarly, after the Guangdong language user requesting song for holding Cantonise dialect terminates, it is assumed that the Tibetan language user for holding Tibetan dialect thinks It requests a song.At this point, Tibetan language user can input the voice signal of " hello " on the voice input interface that intelligent sound box is shown.Intelligence Energy speaker identification " hello " belongs to Tibetan dialect；Then, service request is sent to server, with the never same dialect of request server The corresponding ASR model of Tibetan dialect is selected in corresponding ASR model.After server receives service request, Tibetan dialect is selected Corresponding ASR model, and notification message is returned to intelligent sound box, it is corresponding which is used to indicate selected Tibetan dialect ASR model.Then, intelligent sound box output voice inputs prompt information, such as " please input voice ", to prompt user to carry out voice Input.Tibetan language user inputs the voice signal of song title " my motherland " under the prompt of voice input prompt information.Intelligent sound Case receives the voice signal " my motherland " of user's input, and voice signal " my motherland " is sent to server.Server Using the corresponding ASR model of Tibetan dialect to voice signal " my motherland " carry out speech recognition with obtain text information " I The song to match with " my motherland " is issued to intelligent sound box, so that intelligent sound box plays the song by motherland ".

In the application scenarios, using audio recognition method provided by the embodiments of the present application, as the user for holding different dialects When requesting a song using same intelligent sound box, it is not necessarily to user's manual switching ASR model, voice wake-up word need to be only inputted with corresponding dialect is Can, intelligent sound can automatic identification voice wake up dialect and then request server belonging to word and start the corresponding ASR of corresponding dialect Model identifies the song title of user's point, and while supporting multi-party speech automation requesting song, the efficiency of requesting song can be improved.

Fig. 2 is a kind of flow diagram for audio recognition method that the application another exemplary embodiment provides.The implementation Example can be based on the realization of speech recognition system shown in Fig. 1, the description mainly carried out from the angle of terminal device.As shown in Fig. 2, should Method includes:

21, it receives voice and wakes up word.

22, identification voice wakes up the first dialect belonging to word.

23, service request is sent to server, selects first in corresponding ASR model never with dialect with request server The corresponding ASR model of dialect.

24, voice signal to be identified is sent to server, so that the corresponding ASR model of the first dialect of server by utilizing is treated Recognition of speech signals carries out speech recognition.

When user wants to carry out speech recognition, voice can be inputted to terminal device and wakes up word, which wakes up word and refer to Determine the voice signal of content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Terminal device receives the language of user's input Sound wakes up word, identifies that the voice wakes up dialect belonging to word, and then can determine dialect belonging to subsequent voice signal to be identified (i.e. The voice wakes up dialect belonging to word), basis is provided to carry out speech recognition using the corresponding ASR model of corresponding dialect.For just It in description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.

Then, terminal device sends service to server and asks after identifying that voice wakes up the first dialect belonging to word It asks, service request instruction server is never the same as selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.It connects , voice signal to be identified is sent to server by terminal device.Server upon receipt of a service request, never with dialect pair The corresponding ASR model of the first dialect is selected in the ASR model answered, and passes through the corresponding ASR model pair of selected first dialect The voice signal to be identified received is identified.

In the present embodiment, terminal device identifies that voice wakes up the first dialect belonging to word, and sends service to server and ask It asks, so that server is convenient for being based on first never with the corresponding ASR model of the first dialect is selected in the corresponding ASR model of dialect The corresponding ASR model of dialect carries out speech recognition to subsequent voice signal to be identified, realizes the automatic of more dialect phonetic identifications Change, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, is manually operated, is implemented more without user It is convenient, fast, be conducive to the efficiency for improving more dialect phonetic identifications.Further, more brief based on voice wake-up word, identification The process time-consuming that voice wakes up dialect belonging to word is shorter, and speech recognition system is enabled quickly to identify that voice wakes up belonging to word The first dialect, and select corresponding with the first dialect ASR model, the efficiency identified into a raising to voice to be identified.

In some exemplary embodiments, a kind of mode of the first dialect belonging to above-mentioned identification voice wake-up word includes: Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature, acquisition and voice from the benchmark recorded with different dialects respectively The matching degree for waking up word meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect.Alternatively, above-mentioned knowledge Other voice wake up the first dialect belonging to word another way include: by voice wake up the acoustic feature of word respectively with not Tongfang The acoustic feature of speech is matched, and the matching degree for obtaining the acoustic feature for waking up word with voice meets the dialect of the second sets requirement As the first dialect.Alternatively, another mode that above-mentioned identification voice wakes up the first dialect belonging to word includes: or by voice It wakes up word and is converted into text wake-up word, text is waken up word, and benchmark text corresponding from different dialects wakes up word progress respectively Match, obtains and meet the corresponding dialect of benchmark text wake-up word of third sets requirement as first with the matching degree that text wakes up word Dialect.

In some exemplary embodiments, a kind of mode that above-mentioned reception voice wakes up word includes: in response to activating or opening The instruction for opening terminal device shows voice input interface to user；The voice for obtaining user's input based on voice input interface is called out Awake word.

In some exemplary embodiments, before sending voice signal to be identified to server, this method further include: defeated Voice inputs prompt information out, to prompt user to carry out voice input；Receive the voice signal to be identified of user's input.

In some exemplary embodiments, before output voice input prompt information, this method further include: receive service The notification message that device returns, the notification message are used to indicate the corresponding ASR model of selected first dialect.

Fig. 3 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality Applying example can be realized based on speech recognition system shown in Fig. 1, the description mainly carried out from the angle of server.As shown in figure 3, should Method includes:

31, the service request that receiving terminal apparatus is sent, the corresponding ASR mould of service request instruction the first dialect of selection Type.

32, never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, the first dialect is voice Wake up dialect belonging to word.

33, the voice signal to be identified that receiving terminal apparatus is sent, and knowledge is treated using the corresponding ASR model of the first dialect Other voice signal carries out speech recognition.

In the present embodiment, terminal device is sent after identifying that voice wakes up the first dialect belonging to word to server Service request.Server selects the first dialect from the corresponding ASR model of pre-stored difference dialect according to service request Corresponding ASR model, and then can be that subsequent voice signal carries out speech recognition based on the corresponding ASR model of the first dialect, it realizes The automation of more dialect phonetics identifications, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is not necessarily to user Manual operation, implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.

Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect ASR model, further increase the efficiency of more dialect phonetics identification.

In some exemplary embodiments, server needs to construct not before selecting the corresponding ASR model of the first dialect The corresponding ASR model with dialect.Wherein, a kind of process constructing the corresponding ASR model of different dialects specifically includes that collection is different The corpus of dialect；Feature extraction is carried out to the corpus of different dialects, to obtain the acoustic feature of different dialects；According to different dialects Acoustic feature, construct the corresponding ASR model of different dialects.

In some exemplary embodiments, recognition of speech signals progress is being treated based on the corresponding ASR model of the first dialect After speech recognition, the related information of speech recognition result or speech recognition result can be sent to terminal device, for end End equipment executes subsequent processing based on the related information of speech recognition result or speech recognition result.

Fig. 4 is the structural schematic diagram for another speech recognition system that the application another exemplary embodiment provides.Such as Fig. 4 Show, which includes: server 401 and terminal device 402.Lead between server 401 and terminal device 402 Letter connection.

The framework phase of the framework of speech recognition system 400 provided in this embodiment and speech recognition system 100 shown in fig. 1 Together, difference is that the function of server 401 and terminal device 402 in speech recognition process is different.About terminal in Fig. 4 The way of realization and communication connection mode of equipment 402 and server 401 can be found in the description of embodiment illustrated in fig. 1, herein not It repeats again.

It is similar with speech recognition system 100 shown in Fig. 1, in speech recognition system 400 shown in Fig. 4, terminal device 402 with Server 401 cooperates, and can also provide a user speech identifying function.Moreover, it is contemplated that terminal is set in some cases Standby 402 may be used by multiple users, and multiple users may hold different dialects, then, in speech recognition system 400, ASR model is constructed respectively for different dialects, it in turn, can based on the mutual cooperation between terminal device 402 and server 401 To provide speech identifying function to the user for holding different dialects, it can carry out language to the voice signal for the user for holding different dialects Sound identification.

In speech recognition system 400 shown in Fig. 4, terminal device 402 also supports voice to wake up word function, but terminal device 402 voices for being mainly used for receiving user's input wake up word and are reported to server 401 so that server 401 identifies that voice wakes up Dialect belonging to word, this point are different from the terminal device 102 in embodiment illustrated in fig. 1.Correspondingly, speech recognition shown in Fig. 4 In system 400, server 401 is in addition to providing ASR model towards different dialects and selecting corresponding ASR model under corresponding dialect Voice signal carries out except speech recognition, also has the function of identifying that voice wakes up the affiliated dialect of word.

It speech recognition system 400 based on shown in Fig. 4 can be to terminal device 402 when user wants to carry out speech recognition Input voice and wake up word, which wakes up the voice signal that word is specified content of text, such as " unlatching ", " day cat is smart ", " hello " etc..The voice that terminal device 402 receives user's input wakes up word, and voice wake-up word is sent to server 401.After the voice that server 401 receives the transmission of terminal device 402 wakes up word, identify that the voice wakes up dialect belonging to word. It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.Wherein, the first dialect finger speech sound wakes up word Affiliated dialect, such as can be Mandarin dialect, Shanxi language or Hunan language etc..Then, the never corresponding ASR with dialect of server 401 In model, the corresponding ASR model of the first dialect is selected, in order to which the subsequent corresponding ASR model of the first dialect that is based on is to first party Voice signal under speech carries out speech recognition.In the present embodiment, server 401 is previously stored with the corresponding ASR of different dialects Model.Optionally, a kind of corresponding ASR model of dialect or several similar dialects can also correspond to same ASR mould Type does not limit this.Wherein, the corresponding ASR model of the first dialect is used to the voice signal of the first dialect being converted to text Content.

Terminal device 402 continues to send language to be identified to server 401 after sending voice to server 401 and waking up word Sound signal.The voice signal to be identified that 401 receiving terminal apparatus 402 of server is sent, and utilize the corresponding ASR mould of the first dialect Type treats recognition of speech signals and carries out speech recognition.Optionally, voice signal to be identified can be user and wake up in input voice After word, continue the voice signal inputted to terminal device 402, is based on this, terminal device 402 is sent to server 401 wait know Before other voice signal, the voice signal to be identified of user's input can also be received.Alternatively, voice signal to be identified is also possible to It is prerecorded and stored in the local voice signal of terminal device 402.

In some exemplary embodiments, server 401 identifies that voice wakes up a kind of mode of the first dialect belonging to word Include: that voice is waken up word to wake up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtains The corresponding dialect of word is waken up as the first dialect with the benchmark that the matching degree that voice wakes up word meets the first sets requirement.

In other exemplary embodiments, server 401 identifies that voice wakes up the another kind of the first dialect belonging to word Mode includes: to match the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, and acquisition is called out with voice The matching degree of the acoustic feature of awake word meets the dialect of the second sets requirement as the first dialect.

In other exemplary embodiment, server 401 identifies that voice wakes up another of the first dialect belonging to word Mode includes: that voice wake-up word is converted into text to wake up word, and text is waken up word benchmark text corresponding from different dialects respectively This wake-up word is matched, obtain with text wake up word matching degree meet third sets requirement benchmark text wake-up word it is corresponding Dialect as the first dialect.

Wherein, server 401 identifies that voice wakes up the mode of the first dialect belonging to word and terminal device 102 identifies voice The mode for waking up the first dialect belonging to word is similar, and detailed description can be found in previous embodiment, and details are not described herein.

In some exemplary embodiments, terminal device 402 receive voice wake up word mode include: in response to activation or The instruction of opening terminal apparatus shows voice input interface to user；The voice of user's input is obtained based on voice input interface Wake up word.

In some exemplary embodiments, terminal device 402 is before sending voice signal to be identified to server 401, Voice input prompt information can be exported, to prompt user to carry out voice input；Later, the voice to be identified of user's input is received Signal.

In some exemplary embodiments, terminal device 402 can receive clothes before output voice input prompt information The notification message that business device 401 returns, the notification message are used to indicate the corresponding ASR model of selected first dialect.Based on this, eventually After end equipment 402 can selected the corresponding ASR model of the first dialect determining server 401, it is defeated that voice is exported to user Enter prompt information, to prompt user to carry out voice input, can be sent in this way in the voice signal to be identified for inputting user Server 401 directly can treat recognition of speech signals according to selected ASR model and be identified after server 401.

In some exemplary embodiments, server 401 selects first party in ASR model corresponding never with dialect Before saying corresponding ASR model, the expectation of different dialects can be collected；Feature extraction is carried out to the expectation of different dialects, with To the acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.About structure The detailed process for building the corresponding ASR model of every kind of dialect can be found in the prior art, and details are not described herein.

In some exemplary embodiments, server 401 can return to speech recognition result or voice to terminal device 402 The related information of recognition result.For example, the content of text that speech recognition goes out can be returned to terminal device 402 by server 401； Alternatively, the information such as the song to match with speech recognition result, video can also be returned to terminal device 402 by server 401. Terminal device 402 receives the related information of the speech recognition result that server 401 returns or speech recognition result, and according to voice Recognition result or the related information of speech recognition result execute subsequent processing.

Fig. 5 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality Apply example speech recognition system can realize based on shown in Fig. 4, the description mainly carried out from the angle of terminal device.As shown in figure 5, This method comprises:

51, it receives voice and wakes up word.

52, voice is sent to server and wake up word, wake up the word never corresponding ASR with dialect so that server is based on voice Voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in model.

53, voice signal to be identified is sent to server, so that the corresponding ASR model of the first dialect of server by utilizing is treated Recognition of speech signals carries out speech recognition.

When user wants to carry out speech recognition, voice can be inputted to terminal device and wakes up word, which, which wakes up word, is The voice signal of specified content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Terminal device receives what user sent Voice wakes up word, and sends voice to server and wake up word, so that server identifies that the voice wakes up dialect belonging to word, in turn It can determine dialect belonging to subsequent voice signal to be identified (i.e. the voice wakes up dialect belonging to word), to use corresponding dialect pair The ASR model answered carries out speech recognition and provides basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the One dialect.

Then, server first dialect according to belonging to voice wake-up word, selects in corresponding ASR model never with dialect Voice wakes up the corresponding ASR model of the first dialect belonging to word.Then, terminal device continues to send voice to be identified to server Signal treats recognition of speech signals for the corresponding ASR model of the first dialect of server by utilizing and carries out speech recognition.

In some exemplary embodiments, it includes: in response to being activated or switched on terminal device that above-mentioned reception voice, which wakes up word, Instruction, to user show voice input interface；The voice for obtaining user's input based on voice input interface wakes up word.

In some exemplary embodiments, before output voice input prompt information, this method further include: receive service The notification message that device returns, notification message are used to indicate the corresponding ASR model of selected first dialect.

Fig. 6 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality Apply example speech recognition system can realize based on shown in Fig. 4, the description mainly carried out from the angle of server.As shown in fig. 6, should Method includes:

61, the voice that receiving terminal apparatus is sent wakes up word.

62, identification voice wakes up the first dialect belonging to word.

63, never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected.

64, the voice signal to be identified that receiving terminal apparatus is sent, and knowledge is treated using the corresponding ASR model of the first dialect Other voice signal carries out speech recognition.

The voice that server receiving terminal equipment is sent wakes up word, identifies that the voice wakes up dialect belonging to word, Jin Erke It determines dialect belonging to subsequent voice signal to be identified (i.e. the voice wakes up dialect belonging to word), is corresponding using corresponding dialect ASR model carry out speech recognition provide basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as first Dialect.

Then, server selects the corresponding ASR of the first dialect from the corresponding ASR model of pre-stored difference dialect Model, and then can be that subsequent voice signal carries out speech recognition based on the corresponding ASR model of the first dialect, realize multi-party speech The automation of sound identification, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated without user, It implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.

In some exemplary embodiments, a kind of mode of the first dialect belonging to above-mentioned identification voice wake-up word includes: Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature, acquisition and voice from the benchmark recorded with different dialects respectively The matching degree for waking up word meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect.

In other exemplary embodiments, above-mentioned identification voice wakes up the another way packet of the first dialect belonging to word It includes: the acoustic feature that voice wakes up word is matched from the acoustic feature of different dialects respectively, obtain and wake up word with voice The matching degree of acoustic feature meets the dialect of the second sets requirement as the first dialect.

In other exemplary embodiment, above-mentioned identification voice wakes up another mode packet of the first dialect belonging to word It includes: voice wake-up word being converted into text and wakes up word, text is waken up word, and benchmark text corresponding from different dialects wakes up respectively Word is matched, and the corresponding dialect of benchmark text wake-up word for meeting third sets requirement with the matching degree that text wakes up word is obtained As the first dialect.

In some exemplary embodiments, in ASR model corresponding never with dialect, select the first dialect corresponding Before ASR model, this method further include: collect the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, with To the acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.

In some exemplary embodiments, server can return to speech recognition result or speech recognition knot to terminal device The related information of fruit.For example, the content of text that speech recognition goes out can be returned to terminal device by server；Alternatively, can also be with The information such as the song to match with speech recognition result, video are returned into terminal device.

In the above embodiments, the speech recognition sayed in many ways is executed with terminal device and server, but and unlimited In this.For example, more dialect phonetics can be known if the processing function of terminal device or server is powerful enough with store function Other function is individually integrated on terminal device or server and realizes.Based on this, the application another exemplary embodiment provides one The audio recognition method that kind is independently implemented by server or terminal device.In order to describe simplicity, in the following embodiments, will service Device and terminal device are collectively referred to as electronic equipment.As shown in fig. 7, the speech recognition side independently implemented by server or terminal device Method the following steps are included:

71, it receives voice and wakes up word.

72, identification voice wakes up the first dialect belonging to word.

73, never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.

74, recognition of speech signals is treated using the corresponding ASR model of the first dialect carry out speech recognition.

When user wants to carry out speech recognition, voice can be inputted to electronic equipment and wakes up word, which, which wakes up word, is The voice signal of specified content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Electronic equipment receives what user sent Voice wakes up word, and identifies that voice wakes up the first dialect belonging to word.Wherein, the first dialect finger speech sound wakes up side belonging to word Speech, such as Mandarin dialect, Shanxi language, Hunan language etc..

Then, electronic equipment in corresponding ASR model, selects the corresponding ASR model of the first dialect never with dialect, so as to Speech recognition is carried out to subsequent voice signal to be identified based on the first dialect corresponding ASR model.In the present embodiment, electronics is set It is standby to be previously stored with the corresponding ASR model of different dialects.Optionally, a kind of corresponding ASR model of dialect or a few types As dialect can also correspond to same ASR model, do not limit this.Wherein, the corresponding ASR model of the first dialect is used for the The voice signal of one dialect is converted to content of text.

Electronic equipment can be treated after selecting the corresponding ASR model of the first dialect using the corresponding ASR model of the first dialect Recognition of speech signals carries out speech recognition.Optionally, voice signal to be identified can be user after input voice wakes up word, after Continue the voice signal inputted to electronic equipment, be based on this, electronic equipment is in the corresponding ASR model of the first dialect of utilization to be identified Before voice signal carries out speech recognition, the voice signal to be identified of user's input can also be received.Alternatively, voice letter to be identified Number it is also possible to be prerecorded and stored in the voice signal of electronic equipment local, is based on this, electronic equipment can be directly from this Ground obtains voice signal to be identified.

In some exemplary embodiments, above-mentioned reception voice wakes up word, comprising: in response to being activated or switched on terminal device Instruction, to user show voice input interface；The voice for obtaining user's input based on voice input interface wakes up word.

In some exemplary embodiments, recognition of speech signals progress is being treated using the corresponding ASR model of the first dialect Before speech recognition, this method further include: output voice inputs prompt information, to prompt user to carry out voice input；It receives and uses The voice signal to be identified of family input.

In some exemplary embodiments, the corresponding ASR of the first dialect is never being selected in corresponding ASR model with dialect Before model, this method further include: collect the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, to obtain The acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.

In some exemplary embodiments, recognition of speech signals progress is being treated based on the corresponding ASR model of the first dialect After speech recognition, electronic equipment can execute subsequent place based on the related information of speech recognition result or speech recognition result Reason.

It is worth noting that voice wake-up word can be preset in the above embodiments of the present application or following embodiments； Alternatively, also can permit the customized wake-up word of user.Here customized wake-up word or the preset word that wakes up are primarily referred to as waking up word Content and/or tone etc..Wherein, the function of customized voice wake-up word can be realized by terminal device, can also be by server To realize.Optionally, the function that customized voice wakes up word can be provided by the equipment that identification voice wakes up the affiliated dialect of word.

By taking terminal device provides the customized function of waking up word as an example, terminal device can provide a user a kind of customized Wake up the entrance of word.The entrance can be implemented as a physical button, be based on this, and user can click physical button triggering and wake up Word self-defining operation.Alternatively, the entrance can be the customized subitem of wake-up word in the setting options of terminal device, it is based on this, User can enter the setting options of terminal device, then clicked for the customized subitem of wake-up word, hovered or long-pressing Deng operation, word self-defining operation is waken up to trigger.No matter user is triggered by which kind of mode wakes up word self-defining operation, to end It for end equipment, may be in response to wake up word self-defining operation, receive the customized voice signal of user's input, and will receive Customized voice signal saves as voice and wakes up word.Optionally, terminal device can show an audio typing page to user, with Record the customized voice signal that user issues.For example, user, after triggering wakes up word self-defining operation, terminal device is to user Show the audio typing page, at this point, user can be with input speech signal " hello ", then terminal device receives voice signal " you It can set voice signal " hello " to voice after well " and wake up word.Optionally, terminal device can safeguard a wake-up dictionary, will The customized voice of user wakes up word and saves into wake-up dictionary.

Optionally, voice wakes up the difficulty that word is unsuitable too long, when with dialect belonging to reduction identification, but also unsuitable too short.Language Sound wake-up word is too short, and identification is not high, be easy to cause false wake-up.For example, voice wake up word can between 3 to 5 characters, but It is without being limited thereto.Here 1 character refers to 1 Chinese character, is also possible to 1 English alphabet.

Optionally, it in customized wake-up word, can choose easily distinguishable word, and more common word should not be selected, To reduce application by the probability of false wake-up.

In other embodiments of the application, voice wakes up word and is mainly used for waking up or activating the speech recognition function of application Can, can not qualifier sound wake up dialect belonging to word, i.e. user can call out using any dialect or mandarin to issue voice Awake word.User can issue a voice signal with dialect indicative significance, such as the language after issuing voice and waking up word again Sound signal can be content as the voice signal of " Tianjin words ", " Henan words ", " enabling South Fujian dialect " etc..It then, can be from user The dialect for needing to carry out speech recognition is parsed in the voice signal with dialect indicative significance issued, and then never same dialect Corresponding with the dialect parsed ASR model is selected in corresponding ASR model, and is based on selected ASR model and is carried out pair Subsequent voice signal to be identified carries out speech recognition.For that will have the voice of dialect indicative significance here convenient for distinguishing and describing Signal is known as the first voice signal, and the dialect parsed from first voice signal is known as the first dialect.

Wherein, all voice signals with dialect directive significance can be used as the first voice in the embodiment of the present application Signal.For example, the first voice signal can be the voice signal that user is issued with the first dialect, so as to be believed based on the first voice Number acoustic feature identify the first dialect.Alternatively, the first voice signal can be the voice signal of the title comprising the first dialect, Such as in voice signal " please enable the south of Fujian Province words model ", " the south of Fujian Province words " note is the title of the first dialect.Based on this, Ke Yicong The corresponding phoneme segment of title of the first dialect is extracted in first voice signal, and then identifies the first dialect.

Above-mentioned combination voice wakes up word and the audio recognition method of the first voice signal can be by terminal device and server phase Mutually cooperation is implemented, and can also independently be implemented by terminal device or server.It will be said respectively for different embodiments below It is bright:

Mode A: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device and service Device, which cooperates, to be implemented.In mode A, terminal device supports voice arousal function, can when user wants to carry out speech recognition To wake up word to terminal device input voice, to wake up speech identifying function.Terminal device receives voice and wakes up word, to wake up language Sound identification function.Then, user has the first voice signal of dialect directive significance to terminal device input；Terminal device receives After first voice signal of user's input, the first dialect for needing to carry out speech recognition is parsed from the first voice signal, i.e., Dialect belonging to subsequent voice signal to be identified, to be provided to carry out speech recognition using the corresponding ASR model of corresponding dialect Basis.

Terminal device sends service request, the clothes to server after parsing the first dialect in the first voice signal Business request instruction server is never the same as selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.Server receives After the service request that terminal device is sent, in corresponding ASR model, selected according to the instruction of the service request never with dialect The corresponding ASR model of first dialect, to carry out language to subsequent voice signal to be identified based on the corresponding ASR model of the first dialect Sound identification.

Terminal device continues to send voice signal to be identified to server after sending service request to server, should be to Recognition of speech signals belongs to the first dialect.The voice signal to be identified that server receiving terminal equipment is sent, and according to selection The corresponding ASR model of first dialect treats recognition of speech signals and carries out speech recognition.For treating recognition of speech signals, using with Matched ASR model carry out speech recognition, be conducive to improve speech recognition accuracy.

Optionally, voice signal to be identified can be user after inputting the first voice signal, continue defeated to terminal device The voice signal entered is based on this, and it is defeated can also to receive user before sending voice signal to be identified to server for terminal device The voice signal to be identified entered.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in terminal device local Voice signal.

In some exemplary embodiments, voice wakes up word and is mainly used for waking up the speech identifying function of terminal device；And Subsequent the first dialect for needing to carry out speech recognition can be provided by the first voice signal.Based on this, can not have to limit user's hair Voice wakes up language form used in word out.It issues voice for example, mandarin can be used in user and wakes up word, or can also be with Voice is issued using the first dialect and wakes up word, or other dialects different from the first dialect can also be used to issue voice and waken up Word.

But for same user, in using terminal device procedures can with and there is a possibility that with same language side Formula issues voice signal to terminal device.That is, user may use identical dialect to call out to terminal device input voice Awake word and the first voice signal.For these application scenarios, terminal device receive user input the first voice signal it Afterwards, the first dialect can be preferentially parsed from the first voice signal；Fail to parse the first dialect from the first voice signal, then may be used To identify that voice wakes up dialect belonging to word as the first dialect.Wherein, the specifically implementation of the identification voice wake-up affiliated dialect of word Mode is identical as the identification voice wake-up embodiment of the affiliated dialect of word in above-described embodiment, and details are not described herein.

Mode B: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device and service Device, which cooperates, to be implemented.In mode B, the voice that terminal device is mainly used for receiving user's input wakes up word and the first voice letter Number and be reported to server, so that server parses the first dialect from the first voice signal, this point it is different in mode A Terminal device.Correspondingly, server is in addition to providing ASR model towards different dialects and selecting corresponding ASR model to corresponding dialect Under voice signal carry out except speech recognition, also have and parse the function of the first dialect from the first voice signal.

In mode B, when user wants to carry out speech recognition, voice can be inputted to terminal device and wake up word.Terminal The voice that equipment receives user's input wakes up word, and voice wake-up word is sent to server.Server is waken up based on voice Word wakes up the speech identifying function of itself.User can continue to send the first voice to terminal device after input voice wakes up word Signal.The first voice signal received is sent to server by terminal device.Server is parsed from the first voice signal First dialect, and never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, is based in order to subsequent The corresponding ASR model of first dialect carries out speech recognition to the voice signal under the first dialect.

Terminal device continues to send voice signal to be identified to server after sending the first voice signal to server. Server after selecting the corresponding ASR model of the first dialect, can using the corresponding ASR model of the first dialect to voice to be identified into Row speech recognition.Optionally, voice to be identified can be user after inputting the first voice signal, continue to input to terminal device Voice signal, be based on this, terminal device can also receive user's input before sending voice signal to be identified to server Voice signal to be identified.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in the language of terminal device local Sound signal.

In some exemplary embodiments, server selects the first dialect pair in ASR model corresponding never with dialect Before the ASR model answered, further includes: if failing to parse the first dialect from the first voice signal, identification voice wakes up word institute The dialect of category is as the first dialect.

In some exemplary embodiments, server needs to carry out speech recognition parsing from the first voice signal When the first dialect, comprising: the first voice signal is converted to the first aligned phoneme sequence based on acoustic model；By what is stored in memory The corresponding phoneme segment of different dialect titles is matched in the first aligned phoneme sequence respectively；It is matched when in the first aligned phoneme sequence When middle pitch plain piece section, using the corresponding dialect of phoneme segment in matching as the first dialect.

Mode C: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device or service Device is individually implemented.In mode C, when user wants to carry out speech recognition, voice can be inputted to terminal device or server Wake up word.The voice wake-up word that terminal device or server are inputted according to user, wakes up speech identifying function.User is in input language After sound wakes up word, the first voice signal that there is dialect directive significance to terminal device or server input can be continued.Terminal Equipment or server parse the first dialect from the first voice signal, and never with dialect in corresponding ASR model, selection the The corresponding ASR model of one dialect.

Terminal device or server can utilize the corresponding ASR of the first dialect after selecting the corresponding ASR model of the first dialect Model carries out speech recognition to voice to be identified.Optionally, voice to be identified can be user after inputting the first voice signal, Continue the voice signal inputted to terminal device or server, be based on this, terminal device or server are utilizing the first dialect pair Before the ASR model answered treats recognition of speech signals progress speech recognition, the voice to be identified letter of user's input can also be received Number.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in the voice signal of terminal device or server local, Based on this, terminal device or server directly can obtain voice signal to be identified from local.

In some exemplary embodiments, terminal device or server select in ASR model corresponding never with dialect Before the corresponding ASR model of first dialect, further includes: if failing to parse the first dialect from the first voice signal, identify language Sound wakes up dialect belonging to word as the first dialect.

In some exemplary embodiments, terminal device or server need to carry out parsing from the first voice signal When the first dialect of speech recognition, comprising: the first voice signal is converted to the first aligned phoneme sequence based on acoustic model；It will storage The corresponding phoneme segment of different dialect titles stored in device is matched in the first aligned phoneme sequence respectively；When in the first phoneme When matching middle pitch plain piece section in sequence, using the corresponding dialect of phoneme segment in matching as the first dialect.

Optionally, it in aforesaid way A, mode B and mode C, is parsed from the first voice signal and needs to carry out language First dialect of sound identification, comprising: the first voice signal is converted to by the first aligned phoneme sequence based on acoustic model；By different dialects The corresponding phoneme segment of title is matched in the first aligned phoneme sequence respectively；When the matching middle pitch plain piece in the first aligned phoneme sequence Duan Shi, using the corresponding dialect of phoneme segment in matching as the first dialect.

Wherein, it before the first voice signal is converted to the first aligned phoneme sequence based on acoustic model, needs to the first language Sound signal carries out pretreatment and feature extraction.Wherein preprocessing process includes preemphasis, adding window framing and end-point detection.Feature mentions Take the extraction that the acoustic features such as temporal signatures or frequency domain character are carried out to pretreated first voice signal.

The acoustic feature of first voice signal can be converted to aligned phoneme sequence by acoustic model.Phoneme be constitute pronunciation of words or The fundamental of person's Chinese character pronunciation.Wherein, the phoneme for constituting pronunciation of words can be 39 sounds of Carnegie Mellon University's invention Element；The phoneme for constituting Chinese character pronunciation can be whole initial consonants and simple or compound vowel of a Chinese syllable.Acoustic model is including but not limited to neural network based Deep learning model, hidden Markov model etc..Wherein, the mode that acoustic feature is converted to aligned phoneme sequence is belonged into existing skill Art, details are not described herein again.

Terminal device or server are after being converted to the first aligned phoneme sequence for the first voice signal, by different dialect titles pair The phoneme segment answered is matched in the first aligned phoneme sequence respectively.Wherein it is possible to which the phoneme of different dialect titles is stored in advance Segment, such as phoneme segment, phoneme segment, the dialect title of dialect title " the south of Fujian Province language " of dialect title " Henan words " " British English " etc..If dialect title is word, phoneme segment is 39 invented from Carnegie Mellon University The segment that several phonemes obtained in phoneme are constituted.If dialect title is Chinese character, phoneme segment is the initial consonant of dialect title The segment constituted with simple or compound vowel of a Chinese syllable.Compare the first aligned phoneme sequence phoneme segment corresponding from pre-stored different dialect titles, to sentence Whether include the same or similar phoneme segment of phoneme segment with some dialect title in disconnected first aligned phoneme sequence.Optionally, Can calculate in the first aligned phoneme sequence each phoneme segment respectively from the similarity of the phoneme segment of different dialect titles；Never Tongfang In the phoneme segment for saying title, the similarity of selection and some phoneme segment in the first aligned phoneme sequence meets default similarity requirement Phoneme segment as matching in audio fragment.Then, using the corresponding dialect of phoneme segment in matching as the first dialect.

It is worth noting that having shown in some steps or content and Fig. 1-Fig. 7 in aforesaid way A, mode B and mode C Some steps or content in embodiment are same or similar, these the same or similar contents can be found in be implemented shown in Fig. 1-Fig. 7 Description in example, details are not described herein.

In addition, containing in some processes of the description in above-described embodiment and attached drawing according to particular order appearance Multiple operations, but it should be clearly understood that these operations can not execute or parallel according to its sequence what appears in this article It executes, serial number of operation such as 201,202 etc. is only used for distinguishing each different operation, and serial number itself does not represent any Execute sequence.In addition, these processes may include more or fewer operations, and these operations can execute in order or It is parallel to execute.It should be noted that the description such as herein " first ", " second ", be for distinguish different message, equipment, Module etc. does not represent sequencing, does not also limit " first " and " second " and is different type.

Fig. 8 is a kind of modular structure schematic diagram for speech recognition equipment that the application another exemplary embodiment provides.Such as Shown in Fig. 8, speech recognition equipment 800 includes receiving module 801, identification module 802, the first sending module 803 and the second transmission Module 804.

Receiving module 801 wakes up word for receiving voice.

Identification module 802, the received voice of receiving module 801 wakes up the first dialect belonging to word for identification.

First sending module 803, it is corresponding never with dialect with request server for sending service request to server The corresponding ASR model of the first dialect is selected in ASR model.

Second sending module 804, for sending voice signal to be identified to server, for the first dialect of server by utilizing Corresponding ASR model treats recognition of speech signals and carries out speech recognition.

In an optional embodiment, identification module 802 is specific to use when identifying that voice wakes up the first dialect belonging to word In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect；Or by language The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice Matching degree meet the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and wakes up word, Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.

In an optional embodiment, receiving module 801 is specifically used for: when receiving voice wake-up word in response to activation Or the instruction of opening terminal apparatus, voice input interface is shown to user；The language of user's input is obtained based on voice input interface Sound wakes up word.

In an optional embodiment, the second sending module 804 is before sending voice signal to be identified to server, also For: output voice inputs prompt information, to prompt user to carry out voice input；Receive the voice to be identified letter of user's input Number.

In an optional embodiment, the second sending module 804 is also used to before output voice input prompt information: The notification message that server returns is received, notification message is used to indicate the corresponding ASR model of selected first dialect.

In an optional embodiment, receiving module 801 is also used to before receiving voice and waking up word: in response to waking up Word self-defining operation receives the customized voice signal of user's input；Customized voice signal is saved as into voice and wakes up word.With On describe the built-in function and structure of speech recognition equipment 800, as shown in figure 9, in practice, which can It is embodied as a kind of terminal device, comprising: memory 901, processor 902 and communication component 903.

Memory 901 for storing computer program, and can be stored as storing various other data to support in terminal Operation in equipment.The example of these data includes the finger of any application or method for operating on the terminal device It enables, contact data, telephone book data, message, picture, video etc..

Memory 901 can realize by any kind of volatibility or non-volatile memory device or their combination, Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk Or CD.

Processor 902 is coupled with memory 901, for executing the computer program in memory 901, to be used for: passing through Communication component 903 receives voice and wakes up word；Identify that voice wakes up the first dialect belonging to word；By communication component 903 to service Device sends service request, with request server never with selecting the corresponding ASR mould of the first dialect in the corresponding ASR model of dialect Type；Voice signal to be identified is sent to server by communication component 903, for the corresponding ASR of the first dialect of server by utilizing Model treats recognition of speech signals and carries out speech recognition.

Communication component 903 wakes up word for receiving the voice, and Xiang Suoshu server sends the service request and institute State voice signal to be identified.

In an optional embodiment, processor 902 is specific to use when identifying that voice wakes up the first dialect belonging to word In:

Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, is obtained The benchmark for taking the matching degree for waking up word with voice to meet the first sets requirement wakes up the corresponding dialect of word as the first dialect；Or The acoustic feature that voice wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustics for waking up word with voice The matching degree of feature meets the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and is waken up Word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and wakes up word with text The benchmark text that matching degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.

In an optional embodiment, as shown in figure 9, the terminal device further include: display screen 904.Based on this, processor 902 receive voice wake up word when, be specifically used for: the instruction in response to being activated or switched on terminal device, by display screen 904 to User shows voice input interface；And word is waken up based on the voice that voice input interface obtains user's input.

In an optional embodiment, the terminal device further include: audio component 906.Based on this, processor 902 to It before server sends voice signal to be identified, is also used to: voice being exported by audio component 906 and inputs prompt information, to mention Show that user carries out voice input；The voice signal to be identified of user's input is received by audio component 906.Correspondingly, audio group Part 906 is also used to export voice input prompt information, and receives the voice signal to be identified of user's input.

In an optional embodiment, processor 902 is also used to: before output voice input prompt information by logical Believe that component 903 receives the notification message that server returns, notification message is used to indicate the corresponding ASR mould of selected first dialect Type.

In an optional embodiment, processor 902 is also used to before receiving voice and waking up word: in response to waking up word Self-defining operation receives the customized voice signal of user's input by communication component 903；Customized voice signal is saved as Voice wakes up word.

Further, as shown in figure 9, the terminal device further include: other components such as power supply module 905.

Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter Calculation machine program, which is performed, can be realized each step that can be executed by terminal device in above method embodiment.

Figure 10 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides. As shown in Figure 10, speech recognition equipment 1000 includes the first receiving module 1001, selecting module 1002, the second receiving module 1003 With identification module 1004.

First receiving module 1001, for the service request that receiving terminal apparatus is sent, service request instruction selection first The corresponding ASR model of dialect.

Selecting module 1002, in corresponding ASR model, selecting the corresponding ASR model of the first dialect never with dialect, First dialect is that voice wakes up dialect belonging to word.

Second receiving module 1003, the voice signal to be identified sent for receiving terminal apparatus.

Identification module 1004, for using the corresponding ASR model of the first dialect it is received to the second receiving module 1003 to Recognition of speech signals carries out speech recognition.

In an optional embodiment, speech recognition equipment 1000 further includes building module, for never with dialect pair In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected；To the language of different dialects Material carries out feature extraction, to obtain the acoustic feature of different dialects；According to the acoustic feature of different dialects, different dialects pair are constructed The ASR model answered.

The foregoing describe the built-in function of speech recognition equipment 1000 and structures, and as shown in figure 11, in practice, which knows Other device 1000 can realize a kind of server, comprising: memory 1101, processor 1102 and communication component 1103.

Memory 1101 for storing computer program, and can be stored as storing various other data to support taking The operation being engaged on device.The example of these data includes the instruction of any application or method for operating on the server, Contact data, telephone book data, message, picture, video etc..

Memory 1101 can realize by any kind of volatibility or non-volatile memory device or their combination, Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk Or CD.

Processor 1102 is coupled with memory 1101, for executing the computer program in memory 1101, to be used for: The service request sent by 1103 receiving terminal apparatus of communication component, the corresponding ASR of service request instruction the first dialect of selection Model；Never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, the first dialect is that voice wakes up word Affiliated dialect；The voice signal to be identified sent by 1103 receiving terminal apparatus of communication component, and utilize the first dialect pair The ASR model answered treats recognition of speech signals and carries out speech recognition.

Communication component 1103, for receiving the service request and the voice signal to be identified.

In an optional embodiment, processor 1102 selects the first dialect in ASR model corresponding never with dialect It before corresponding ASR model, is also used to: collecting the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, with To the acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.

Further, as shown in figure 11, server further include: audio component 1106.Based on this, processor 1102 is also used In: the voice signal to be identified sent by 1106 receiving terminal apparatus of audio component.

Optionally, as shown in figure 11, which further includes other components such as display screen 1104, power supply module 1105.

Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter Calculation machine program, which is performed, can be realized each step that can be executed by server in above method embodiment.

In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice wakes up in advance Dialect belonging to word, and then never with the ASR corresponding with dialect belonging to voice wake-up word of selection in the corresponding ASR model of dialect Model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes more dialect phonetic identifications Automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, implements without user It is more convenient, quick, be conducive to the efficiency for improving more dialect phonetic identifications.

Figure 12 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides. As shown in figure 12, speech recognition equipment 1200 includes receiving module 1201, the first sending module 1202 and the second sending module 1203。

Receiving module 1201 wakes up word for receiving voice.

First sending module 1202, for waking up word to the received voice of server sending/receiving module 1201, for clothes It is never corresponding with selecting voice to wake up the first dialect belonging to word in the corresponding ASR model of dialect that business device is based on voice wake-up word ASR model.

Second sending module 1203, for sending voice signal to be identified to server, for server by utilizing first party Say that corresponding ASR model treats recognition of speech signals and carries out speech recognition.

In an optional embodiment, receiving module 1201 is specifically used for: when receiving voice wake-up word in response to activation Or the instruction of opening terminal apparatus, voice input interface is shown to user；The language of user's input is obtained based on voice input interface Sound wakes up word.

In an optional embodiment, the second sending module 1203 to server send voice signal to be identified it Before, it is also used to: output voice input prompt information, to prompt user to carry out voice input；Receive the language to be identified of user's input Sound signal.

In an optional embodiment, the second sending module 1203 is also used to before output voice input prompt information: The notification message that server returns is received, notification message is used to indicate the corresponding ASR model of selected first dialect.

In an optional embodiment, receiving module 1201 is also used to before receiving voice and waking up word: in response to waking up Word self-defining operation receives the customized voice signal of user's input.First sending module 1202 is also used to customized voice Signal is uploaded to server.

The foregoing describe the built-in function of speech recognition equipment 1200 and structures, and as shown in figure 13, in practice, which knows Other device 1200 can be realized as a kind of terminal device, comprising: memory 1301, processor 1302 and communication component 1303.

Memory 1301 for storing computer program, and can be stored as storing various other data to support at end Operation in end equipment.The example of these data includes the finger of any application or method for operating on the terminal device It enables, contact data, telephone book data, message, picture, video etc..

Memory 1301 can realize by any kind of volatibility or non-volatile memory device or their combination, Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk Or CD.

Processor 1302 is coupled with memory 1301, for executing the computer program in memory 1301, to be used for: Voice, which is received, by communication component 1303 wakes up word；Voice is sent to server by communication component 1303 and wakes up word, for clothes It is never corresponding with selecting voice to wake up the first dialect belonging to word in the corresponding ASR model of dialect that business device is based on voice wake-up word ASR model；Voice signal to be identified is sent to server by communication component 1303, so that the first dialect of server by utilizing is corresponding ASR model treat recognition of speech signals carry out speech recognition.

Communication component 1303 wakes up word for receiving the voice, Xiang Suoshu server send the voice wake up word and The voice signal to be identified

In an optional embodiment, as shown in figure 13, which further includes display screen 1304.Based on this, handle For device 1302 when receiving voice wake-up word, be specifically used for: the instruction in response to being activated or switched on terminal device passes through display screen 1304 show voice input interface to user；And word is waken up based on the voice that voice input interface obtains user's input.

In an optional embodiment, as shown in figure 13, which further includes audio component 1306.Based on this, locate Reason device 1302 is used for: being received voice by audio component 1306 and is waken up word.Correspondingly, processor 1302 to server send to It before recognition of speech signals, is also used to: voice being exported by audio component 1306 and inputs prompt information, to prompt user to carry out language Sound input；And receive the voice signal to be identified of user's input.

In an optional embodiment, processor 1302 is also used to before output voice input prompt information: receiving clothes The notification message that business device returns, notification message are used to indicate the corresponding ASR model of selected first dialect.

In an optional embodiment, processor 1302 is also used to before receiving voice and waking up word: in response to waking up word Self-defining operation receives the customized voice signal of user's input by communication component 1303, and will be on customized voice signal Reach server.

Further, as shown in figure 13, terminal device further include: other components such as power supply module 1305.

Figure 14 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides. As shown in figure 14, speech recognition equipment 1400 includes the first receiving module 1401, the first identification module 1402, selecting module 1403, the second receiving module 1404, the second identification module 1405.

First receiving module 1401, the voice sent for receiving terminal apparatus wake up word.

First identification module 1402, voice wakes up the first dialect belonging to word for identification.

Selecting module 1403, in corresponding ASR model, selecting the corresponding ASR model of the first dialect never with dialect.

Second receiving module 1404, the voice signal to be identified sent for receiving terminal apparatus.

Second identification module 1405, for being received using the corresponding ASR model of the first dialect to the second receiving module 1404 Voice signal to be identified carry out speech recognition.

In an optional embodiment, the first identification module 1402 identify voice wake up word belonging to the first dialect when, It is specifically used for: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, It obtains and meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect with the matching degree that voice wakes up word；Or Person matches the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, obtains the sound that word is waken up with voice The matching degree for learning feature meets the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and is called out Awake word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and text wake-up word Matching degree meet the benchmark text of third sets requirement and wake up the corresponding dialect of word as the first dialect.

In an optional embodiment, speech recognition equipment 1400 further includes building module, for never with dialect pair In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected；To the language of different dialects Material carries out feature extraction, to obtain the acoustic feature of different dialects；According to the acoustic feature of different dialects, different dialects pair are constructed The ASR model answered.

The foregoing describe the built-in function of speech recognition equipment 1400 and structures, and as shown in figure 15, in practice, which knows Other device 1400 can be realized as a kind of server, comprising: memory 1501, processor 1502 and communication component 1503.

Memory 1501 for storing computer program, and can be stored as storing various other data to support taking The operation being engaged on device.The example of these data includes the instruction of any application or method for operating on the server, Contact data, telephone book data, message, picture, video etc..

Memory 1501 can realize by any kind of volatibility or non-volatile memory device or their combination, Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk Or CD.

Processor 1502 is coupled with memory 1501, for executing the computer program in memory 1501, to be used for: Word is waken up by the voice that 1503 receiving terminal apparatus of communication component is sent；Identify that voice wakes up the first dialect belonging to word；From In the corresponding ASR model of different dialects, the corresponding ASR model of the first dialect is selected；Terminal is received by communication component 1503 to set The voice signal to be identified that preparation is sent, and treat recognition of speech signals using the corresponding ASR model of the first dialect and carry out voice knowledge Not.

Communication component 1503 wakes up word and voice signal to be identified for receiving voice.

In an optional embodiment, processor 1502 is specific to use when identifying that voice wakes up the first dialect belonging to word In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect；Or by language The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice Matching degree meet the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and wakes up word, Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.

In an optional embodiment, processor 1502 selects the first dialect in ASR model corresponding never with dialect Before corresponding ASR model, it is also used to collect the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, with To the acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.

Further, as shown in figure 15, server further include: audio component 1506.Based on this, processor 1502 is used for: Word is waken up by the voice that 1506 receiving terminal apparatus of audio component is sent, and the terminal is received by audio component 1506 and is set The voice signal to be identified that preparation is sent.

Further, as shown in figure 15, server further include: other components such as display screen 1504, power supply module 1505.

Figure 16 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides. As shown in figure 16, speech recognition equipment 1600 includes receiving module 1601, the first identification module 1602, selecting module 1603, the Two identification modules 1604.

Receiving module 1601 wakes up word for receiving voice.

First identification module 1602, voice wakes up the first dialect belonging to word for identification.

Selecting module 1603, for never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.

Second identification module 1604 carries out language for treating recognition of speech signals using the corresponding ASR model of the first dialect Sound identification.

In an optional embodiment, the first identification module 1602 identify voice wake up word belonging to the first dialect when, It is specifically used for: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, It obtains and meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect with the matching degree that voice wakes up word；Or Person matches the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, obtains the sound that word is waken up with voice The matching degree for learning feature meets the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and is called out Awake word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and text wake-up word Matching degree meet the benchmark text of third sets requirement and wake up the corresponding dialect of word as the first dialect.

In an optional embodiment, receiving module 1601 is when the voice that receiving terminal apparatus is sent wakes up word, specifically For: the instruction in response to being activated or switched on terminal device shows voice input interface to user；It is obtained based on voice input interface The voice for taking family input wakes up word.

In an optional embodiment, the second identification module 1604 is treating knowledge using the corresponding ASR model of the first dialect Before other voice signal carries out speech recognition, it is also used to: output voice input prompt information, it is defeated to prompt user to carry out voice Enter；Receive the voice signal to be identified of user's input.

In an optional embodiment, speech recognition equipment 1600 further includes building module, for never with dialect pair In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected；To the language of different dialects Material carries out feature extraction, to obtain the acoustic feature of different dialects；According to the acoustic feature of different dialects, different dialects pair are constructed The ASR model answered.

In an optional embodiment, receiving module 1601 is also used to before receiving voice and waking up word: in response to waking up Word self-defining operation receives the customized voice signal of user's input；Customized voice signal is saved as into voice and wakes up word.

The foregoing describe the built-in function of speech recognition equipment 1600 and structures, and as shown in figure 17, in practice, which knows Other device 1600 can be realized as a kind of electronic equipment, comprising: memory 1701, processor 1702 and communication component 1703.It should Electronic equipment can be terminal device, be also possible to server.

Memory 1701 for storing computer program, and can be stored as storing various other data to support in electricity Operation in sub- equipment.The example of these data includes the finger of any application or method for operating on an electronic device It enables, contact data, telephone book data, message, picture, video etc..

Memory 1701 can realize by any kind of volatibility or non-volatile memory device or their combination, Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk Or CD.

Processor 1702 is coupled with memory 1701, for executing the computer program in memory 1701, to be used for: Voice, which is received, by communication component 1703 wakes up word；Identify that voice wakes up the first dialect belonging to word；It is never corresponding with dialect The corresponding ASR model of the first dialect is selected in ASR model；Recognition of speech signals is treated using the corresponding ASR model of the first dialect Carry out speech recognition.

Communication component 1703 wakes up word for receiving voice.

In an optional embodiment, processor 1702 is specific to use when identifying that voice wakes up the first dialect belonging to word In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect；Or by language The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice Matching degree meet the dialect of the second sets requirement as the first dialect；Or voice wake-up word is converted into text and wakes up word, Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.

In an optional embodiment, as shown in figure 17, the electronic equipment further include: display screen 1704.Based on this, handle Device 1702 is specifically used for: the finger in response to being activated or switched on terminal device when the voice that receiving terminal apparatus is sent wakes up word It enables, voice input interface is shown to user by display screen 1704；And the voice of user's input is obtained based on voice input interface Wake up word.

In an optional embodiment, as shown in figure 17, the electronic equipment further include: audio component 1706.Based on this, locate Reason device 1702 is also used to before being treated recognition of speech signals using the corresponding ASR model of the first dialect and carrying out speech recognition: logical It crosses audio component 1706 and exports voice input prompt information, to prompt user to carry out voice input；And receive user input to Recognition of speech signals.Correspondingly, processor 1702 is also used to: being received voice by audio component 1706 and is waken up word.

In an optional embodiment, processor 1702 selects the first dialect in ASR model corresponding never with dialect Before corresponding ASR model, it is also used to collect the corpus of different dialects；Feature extraction is carried out to the corpus of different dialects, with To the acoustic feature of different dialects；According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.

In an optional embodiment, processor 1702 is also used to before receiving voice and waking up word: in response to waking up word Self-defining operation receives the customized voice signal of user's input by communication component 1703；Customized voice signal is saved Word is waken up for voice.Further, as shown in figure 17, electronic equipment further include: other components such as power supply module 1705.

Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter Calculation machine program, which is performed, can be realized each step that can be executed by electronic equipment in above method embodiment.

The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component.

Memory for storing computer program, and can be stored as storing various other data to support to set in terminal Standby upper operation.The example of these data includes the instruction of any application or method for operating on the terminal device, Contact data, telephone book data, message, picture, video etc..

Memory can be realized by any kind of volatibility or non-volatile memory device or their combination, such as quiet State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), the read-only storage of erasable programmable Device (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or light Disk.

Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical It crosses communication component and receives voice wake-up word, to wake up speech identifying function；It is square that having for user's input is received by communication component Say the first voice signal of indicative significance；The first dialect for needing to carry out speech recognition is parsed from the first voice signal；From The corresponding ASR model of the first dialect is selected in the corresponding ASR model of different dialects；It is sent and is serviced to server by communication component Request, with request server never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect；Pass through Communication component sends voice signal to be identified to server, so that the corresponding ASR model of the first dialect of server by utilizing treats knowledge Other voice signal carries out speech recognition.

The communication component wakes up word and first voice signal for receiving voice, and sends out to the server Send service request and voice signal to be identified.

In an optional embodiment, processor is also used to before sending service request to server: if failing from the The first dialect is parsed in one voice signal, identification voice wakes up dialect belonging to word as the first dialect.

In an optional embodiment, memory is also used to store the corresponding phoneme segment of different dialect titles.Correspondingly, Processor is specifically used for when parsing the first dialect for needing to carry out speech recognition from the first voice signal: being based on acoustics First voice signal is converted to the first aligned phoneme sequence by model；The corresponding sound of different dialect titles that will be stored in memory Plain piece section is matched in first aligned phoneme sequence respectively；When the matching middle pitch plain piece section in first aligned phoneme sequence When, using the corresponding dialect of phoneme segment in the matching as first dialect.

The embodiment of the present application also provides a kind of server, comprising: memory, processor and communication component.

Memory for storing computer program, and can be stored as storing various other data to support in server On operation.The example of these data includes the instruction of any application or method for operating on the server, connection Personal data, telephone book data, message, picture, video etc..

Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical The voice for crossing the transmission of communication component receiving terminal apparatus wakes up word, to wake up speech identifying function；It is received eventually by communication component The first voice signal with dialect indicative significance that end equipment is sent；It is parsed from the first voice signal and needs to carry out voice First dialect of identification；Never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect；Pass through communication set The voice signal to be identified that part receiving terminal apparatus is sent, and voice to be identified is believed using the corresponding ASR model of the first dialect Number carry out speech recognition.

Communication component wakes up word, the first voice signal and voice signal to be identified for receiving voice.

In an optional embodiment, processor selects the first dialect corresponding in ASR model corresponding never with dialect ASR model before, be also used to: if failing to parse the first dialect from the first voice signal, identification voice wake up word belonging to Dialect as the first dialect.

The embodiment of the present application also provides a kind of electronic equipment, which can be terminal device, is also possible to service Device.The electronic equipment includes: memory, processor and communication component.

Memory for storing computer program, and can be stored as storing various other data to support to set in electronics Standby upper operation.The example of these data includes the instruction of any application or method for operating on an electronic device, Contact data, telephone book data, message, picture, video etc..

Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical It crosses communication component and receives voice wake-up word, to wake up speech identifying function；It is square that having for user's input is received by communication component Say the first voice signal of indicative significance；The first dialect for needing to carry out speech recognition is parsed from the first voice signal；From The corresponding ASR model of the first dialect is selected in the corresponding ASR model of different dialects；Utilize the corresponding ASR model pair of the first dialect Voice signal to be identified carries out speech recognition.

Communication component wakes up word and the first voice signal for receiving voice.

Communication component in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 is stored as convenient for equipment where communication component The communication of wired or wireless way between other equipment.Equipment where communication component can be accessed based on the wireless of communication standard Network, such as WiFi, 2G or 3G or their combination.In one exemplary embodiment, communication component is received via broadcast channel Broadcast singal or broadcast related information from external broadcasting management system.In one exemplary embodiment, communication component is also Including near-field communication (NFC) module, to promote short range communication.For example, it can be based on radio frequency identification (RFID) technology in NFC module, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.

Display screen in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 includes liquid crystal display (LCD) and touch panel (TP).If display screen includes touch panel, display screen may be implemented as touch screen, to receive input letter from the user Number.Touch panel includes one or more touch sensors to sense the gesture on touch, slide, and touch panel.Touch sensing Device can not only sense the boundary of a touch or slide action, but also detect the duration relevant with touch or slide and Pressure.

Power supply module in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 is the various assemblies of equipment where power supply module Electric power is provided.Power supply module may include power-supply management system, one or more power supplys and other with to set where power supply module It is standby to generate, manage, and distribute the associated component of electric power.

Audio component in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 can be stored as output and/or input audio letter Number.For example, audio component includes a microphone (MIC), the equipment where audio component is in operation mode, such as calls mould When formula, logging mode and speech recognition mode, microphone is stored as receiving external audio signal.The received audio signal can To be further stored in memory or be sent via communication component.In some embodiments, audio component further includes one and raises Sound device is used for output audio signal.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical storage, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.

The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal Replacement, improvement etc., should be included within the scope of the claims of this application.

Claims

1. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:

It receives voice and wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Service request is sent to server, to request the server never with selecting described the in the corresponding ASR model of dialect The corresponding ASR model of one dialect；

Voice signal to be identified is sent to the server, for the corresponding ASR mould of the first dialect described in the server by utilizing Type carries out speech recognition to the voice signal to be identified.

2. the method according to claim 1, wherein the identification voice wakes up first party belonging to word Speech, comprising:

The voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, is obtained The benchmark for taking the matching degree for waking up word with the voice to meet the first sets requirement wakes up the corresponding dialect of word as described first Dialect；Or

The acoustic feature that the voice wakes up word is matched from the acoustic feature of different dialects respectively, is obtained and the voice The matching degree for waking up the acoustic feature of word meets the dialect of the second sets requirement as first dialect；Or

Voice wake-up word is converted into text and wakes up word, the text is waken up into word benchmark corresponding from different dialects respectively Text wakes up word and is matched, and obtains the benchmark text wake-up that the matching degree for waking up word with the text meets third sets requirement The corresponding dialect of word is as first dialect.

3. the method according to claim 1, wherein the reception voice wakes up word, comprising:

In response to being activated or switched on the instruction of the terminal device, voice input interface is shown to user；

The voice for obtaining user's input based on the voice input interface wakes up word.

4. method according to claim 1-3, which is characterized in that sending voice to be identified to the server Before signal, the method also includes:

It exports voice and inputs prompt information, to prompt user to carry out voice input；

Receive the voice signal to be identified of user's input.

5. according to the method described in claim 4, it is characterized in that, output voice input prompt information before, the method Further include:

Receive the notification message that the server returns, the notification message is used to indicate that selected first dialect corresponding ASR model.

6. method according to claim 1-3, which is characterized in that before receiving voice and waking up word, the side Method further include:

In response to waking up word self-defining operation, the customized voice signal of user's input is received；

The customized voice signal is saved as into the voice and wakes up word.

7. a kind of audio recognition method is suitable for server, which is characterized in that the described method includes:

Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected, first dialect is institute Predicate sound wakes up dialect belonging to word；

The voice signal to be identified that the terminal device is sent is received, and using the corresponding ASR model of first dialect to institute It states voice signal to be identified and carries out speech recognition.

8. the method according to the description of claim 7 is characterized in that in ASR model corresponding never with dialect, described in selection Before the corresponding ASR model of first dialect, the method also includes:

Collect the corpus of different dialects；

Feature extraction is carried out to the corpus of the different dialects, to obtain the acoustic feature of different dialects；

According to the acoustic feature of the different dialects, the corresponding ASR model of different dialects is constructed.

9. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:

It receives voice and wakes up word；

The voice is sent to server and wakes up word, so that server is corresponding never with dialect based on voice wake-up word The voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in ASR model；

10. a kind of audio recognition method is suitable for server, which is characterized in that the described method includes:

The voice that receiving terminal apparatus is sent wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

11. a kind of audio recognition method characterized by comprising

It receives voice and wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

12. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:

It receives voice and wakes up word, to wake up speech identifying function；

13. according to the method for claim 12, which is characterized in that before sending service request to server, the side Method further include:

If failing to parse first dialect from first voice signal, identify that the voice wakes up dialect belonging to word As first dialect.

14. method according to claim 12 or 13, which is characterized in that described to be parsed from first voice signal Need to carry out the first dialect of speech recognition, comprising:

First voice signal is converted into the first aligned phoneme sequence based on acoustic model；

The corresponding phoneme segment of different dialect titles is matched in first aligned phoneme sequence respectively；

When matching middle pitch plain piece section in first aligned phoneme sequence, the corresponding dialect of phoneme segment in the matching is made For first dialect.

15. a kind of terminal device characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

Voice, which is received, by the communication component wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

Service request is sent to server by the communication component, to request the server never corresponding ASR with dialect The corresponding ASR model of first dialect is selected in model；

Voice signal to be identified is sent to the server by the communication component, for first described in the server by utilizing The corresponding ASR model of dialect carries out speech recognition to the voice signal to be identified；

The communication component wakes up word for receiving the voice, and Xiang Suoshu server sends the service request and described Voice signal to be identified.

16. a kind of server characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

The service request sent by the communication component receiving terminal apparatus, service request instruction first dialect pair of selection The ASR model answered；

The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes first dialect pair The ASR model answered carries out speech recognition to the voice signal to be identified；

17. a kind of terminal device characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

Voice, which is received, by the communication component wakes up word；

Send the voice to server by the communication component and wake up word, for server be based on the voice wake up word from The voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in the corresponding ASR model of different dialects；

The communication component wakes up word for receiving the voice, and Xiang Suoshu server sends the voice and wakes up word and described Voice signal to be identified.

18. a kind of server characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

Identify that the voice wakes up the first dialect belonging to word；

19. a kind of electronic equipment characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

Voice, which is received, by the communication component wakes up word；

Identify that the voice wakes up the first dialect belonging to word；

The communication component wakes up word for receiving the voice.

20. a kind of terminal device characterized by comprising memory, processor and communication component；

The memory, for storing computer program；

Voice signal to be identified is sent to the server by the communication component, for first described in the server by utilizing The corresponding ASR model of dialect carries out speech recognition to the voice signal to be identified

The communication component wakes up word and first voice signal for receiving the voice, and sends out to the server Send the service request and the voice signal to be identified.

21. a kind of computer readable storage medium for being stored with computer program, which is characterized in that the computer program is counted The step of calculation machine can be realized any one of claim 1-6 the method when executing.

22. a kind of computer readable storage medium for being stored with computer program, which is characterized in that the computer program is counted The step of calculation machine can be realized any one of claim 7-8 the method when executing.

23. a kind of speech recognition system, which is characterized in that including server and terminal device；

The terminal device wakes up word for receiving voice, identifies that the voice wakes up the first dialect belonging to word, and to described Server sends service request, and sends voice signal to be identified, the service request instruction selection institute to the server State the corresponding ASR model of the first dialect；

The server, it is never corresponding with dialect according to the instruction of the service request for receiving the service request In ASR model, the corresponding ASR model of first dialect is selected, and receive the voice signal to be identified, and described in utilization The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified.

24. a kind of speech recognition system, which is characterized in that including server and terminal device；

The terminal device wakes up word for receiving voice, and Xiang Suoshu server sends the voice and wakes up word, and to described Server sends voice signal to be identified；

The server wakes up word for receiving the voice, identifies that the voice wakes up the first dialect belonging to word, from difference In the corresponding ASR model of dialect, the corresponding ASR model of first dialect is selected, and receive the voice signal to be identified, And speech recognition is carried out to the voice signal to be identified using first dialect corresponding ASR model.