CN109817220A - Audio recognition method, apparatus and system - Google Patents
Audio recognition method, apparatus and system Download PDFInfo
- Publication number
- CN109817220A CN109817220A CN201711147698.XA CN201711147698A CN109817220A CN 109817220 A CN109817220 A CN 109817220A CN 201711147698 A CN201711147698 A CN 201711147698A CN 109817220 A CN109817220 A CN 109817220A
- Authority
- CN
- China
- Prior art keywords
- dialect
- voice
- word
- wakes
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 135
- 238000004891 communication Methods 0.000 claims description 107
- 238000004590 computer program Methods 0.000 claims description 59
- 230000002618 waking effect Effects 0.000 claims description 42
- 230000006870 function Effects 0.000 claims description 41
- 238000003860 storage Methods 0.000 claims description 23
- 230000004044 response Effects 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 description 36
- 238000010586 diagram Methods 0.000 description 33
- 238000012545 processing Methods 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 10
- 230000005236 sound signal Effects 0.000 description 10
- 241001672694 Citrus reticulata Species 0.000 description 6
- 230000003068 static effect Effects 0.000 description 6
- 230000002123 temporal effect Effects 0.000 description 6
- 241000282326 Felis catus Species 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000003213 activating effect Effects 0.000 description 2
- 238000005452 bending Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000003825 pressing Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000037007 arousal Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present application provides a kind of audio recognition method, apparatus and system.Wherein, method includes: and receives voice to wake up word;Identify that voice wakes up the first dialect belonging to word;Service request is sent to server, with request server never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect;Voice signal to be identified is sent to server, recognition of speech signals is treated for the corresponding ASR model of the first dialect of server by utilizing and carries out speech recognition.Method provided in this embodiment automatically can carry out speech recognition to multi-party speech, improve the efficiency that speech recognition is carried out for multi-party speech.
Description
Technical field
This application involves technical field of voice recognition more particularly to a kind of audio recognition methods, apparatus and system.
Background technique
Automatic speech recognition (Automatic Speech Recognition, ASR) is that one kind can be the voice of the mankind
Audio signal is converted to the technology of content of text.With the development of software and hardware technology, the computing capability of various smart machines and deposit
Storage capacity has great progress, so that speech recognition technology is widely applied in smart machine.
It in speech recognition technology, needs to accurately identify phoneme of speech sound, could be converted based on the phoneme of speech sound accurately identified
For text.But whether which kind of language, the language can be all caused because of various factors, and there are many different pronunciations, i.e., in many ways
Speech.By taking Chinese as an example, there are a variety of dialects such as Mandarin dialect, Shanxi language, Hunan language, Jiangxi language, the Wu dialect, Fujian language, Guangdong language, objective language, not Tongfang
The pronunciation of speech differs greatly.
Currently, the speech recognition schemes for dialect are still immature, need to be directed to multi-party speech problem a kind of solution party is provided
Case.
Summary of the invention
The many aspects of the application provide a kind of audio recognition method, apparatus and system, to automatically to multi-party speech
Speech recognition is carried out, the efficiency for carrying out speech recognition for multi-party speech is improved.
The embodiment of the present application provides a kind of audio recognition method, is suitable for terminal device, this method comprises:
It receives voice and wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Service request is sent to server, to request the server to select institute in corresponding ASR model never with dialect
State the corresponding ASR model of the first dialect;
Voice signal to be identified is sent to the server, so that the first dialect described in the server by utilizing is corresponding
ASR model carries out speech recognition to the voice signal to be identified.
The embodiment of the present application also provides a kind of audio recognition method, is suitable for server, this method comprises:
The service request that receiving terminal apparatus is sent, the service request instruction select the corresponding ASR model of the first dialect;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect, first dialect are selected
It is that the voice wakes up dialect belonging to word;
The voice signal to be identified that the terminal device is sent is received, and utilizes the corresponding ASR model of first dialect
Speech recognition is carried out to the voice signal to be identified.
The embodiment of the present application also provides a kind of audio recognition method, is suitable for terminal device, this method comprises:
It receives voice and wakes up word;
The voice is sent to server and wakes up word, is never corresponded to dialect so that server is based on voice wake-up word
ASR model in select the voice to wake up the corresponding ASR model of the first dialect belonging to word;
Voice signal to be identified is sent to the server, so that the first dialect described in the server by utilizing is corresponding
ASR model carries out speech recognition to the voice signal to be identified.
The embodiment of the present application also provides a kind of audio recognition method, is suitable for server, this method comprises:
The voice that receiving terminal apparatus is sent wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected;
The voice signal to be identified that the terminal device is sent is received, and utilizes the corresponding ASR model of first dialect
Speech recognition is carried out to the voice signal to be identified.
The embodiment of the present application also provides a kind of audio recognition method, comprising:
It receives voice and wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect;
Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition.
The embodiment of the present application also provides a kind of audio recognition method, is suitable for terminal device, this method comprises:
It receives voice and wakes up word, to wake up speech identifying function;
Receive the first voice signal with dialect indicative significance of user's input;
The first dialect for needing to carry out speech recognition is parsed from first voice signal;
Service request is sent to server, to request the server to select institute in corresponding ASR model never with dialect
State the corresponding ASR model of the first dialect;
Voice signal to be identified is sent to the server, so that the first dialect described in the server by utilizing is corresponding
ASR model carries out speech recognition to the voice signal to be identified.
The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Service request is sent to server by the communication component, to request the server corresponding never with dialect
The corresponding ASR model of first dialect is selected in ASR model;
Voice signal to be identified is sent to the server by the communication component, for described in the server by utilizing
The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified;
The communication component wakes up word for receiving the voice, Xiang Suoshu server send the service request and
The voice signal to be identified.
The embodiment of the present application also provides a kind of server, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
The service request sent by the communication component receiving terminal apparatus, the service request instruction selection first party
Say corresponding ASR model;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect, first dialect are selected
It is that the voice wakes up dialect belonging to word;
The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes the first party
Say that corresponding ASR model carries out speech recognition to the voice signal to be identified;
The communication component, for receiving the service request and the voice signal to be identified.
The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
The voice is sent to server by the communication component and wakes up word, so that server is waken up based on the voice
Word is never the same as selecting the voice to wake up the corresponding ASR model of the first dialect belonging to word in the corresponding ASR model of dialect;
Voice signal to be identified is sent to the server by the communication component, for described in the server by utilizing
The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified;
The communication component wakes up word for receiving the voice, Xiang Suoshu server send the voice wake up word and
The voice signal to be identified.
The embodiment of the present application also provides a kind of server, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Word is waken up by the voice that the communication component receiving terminal apparatus is sent;
Identify that the voice wakes up the first dialect belonging to word;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected;
The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes the first party
Say that corresponding ASR model carries out speech recognition to the voice signal to be identified;
The communication component wakes up word and the voice signal to be identified for receiving the voice.
The embodiment of the present application also provides a kind of electronic equipment, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect;
Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition;
The communication component wakes up word for receiving the voice.
The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice is received by the communication component and wakes up word, to wake up speech identifying function;
The first voice signal with dialect indicative significance of user's input is received by the communication component;
The first dialect for needing to carry out speech recognition is parsed from first voice signal;
Service request is sent to server by the communication component, to request the server corresponding never with dialect
The corresponding ASR model of first dialect is selected in ASR model;
Voice signal to be identified is sent to the server by the communication component, for described in the server by utilizing
The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified
The communication component wakes up word and first voice signal for receiving the voice, and to the service
Device sends the service request and the voice signal to be identified.
The embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, the computer
It can be realized the step in the first above-mentioned audio recognition method embodiment when program is computer-executed.
The embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, and feature exists
In the computer program can be realized the step in above-mentioned second of audio recognition method embodiment when being computer-executed.
The embodiment of the present application also provides a kind of speech recognition system, including server and terminal device;
The terminal device wakes up word for receiving voice, identifies that the voice wakes up the first dialect belonging to word, and to
The server sends service request, and sends voice signal to be identified, the service request instruction choosing to the server
Select the corresponding ASR model of first dialect;
The server, it is never corresponding with dialect according to the instruction of the service request for receiving the service request
ASR model in, select the corresponding ASR model of first dialect, and receive the voice signal to be identified, and utilize institute
It states the first dialect corresponding ASR model and speech recognition is carried out to the voice signal to be identified.
The embodiment of the present application also provides a kind of speech recognition system, which is characterized in that including server and terminal device;
The terminal device wakes up word for receiving voice, and Xiang Suoshu server sends the voice and wakes up word, Yi Jixiang
The server sends voice signal to be identified;
The server wakes up word for receiving the voice, identifies that the voice wakes up the first dialect belonging to word, from
In the corresponding ASR model of different dialects, the corresponding ASR model of first dialect is selected, and receive the voice to be identified
Signal, and speech recognition is carried out to the voice signal to be identified using first dialect corresponding ASR model.
In the embodiment of the present application, language is identified in advance in speech recognition process for different dialects building ASR model
Sound wakes up dialect belonging to word, and then selection and voice wake up dialect pair belonging to word in corresponding ASR model never with dialect
The ASR model answered carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes multi-party speech
The automation of sound identification, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated without user,
It implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen
Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:
Fig. 1 is a kind of structural schematic diagram for speech recognition system that one exemplary embodiment of the application provides;
Fig. 2 is a kind of flow diagram for audio recognition method that the application another exemplary embodiment provides;
Fig. 3 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides;
Fig. 4 is the structural schematic diagram for another speech recognition system that the application another exemplary embodiment provides;
Fig. 5 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides;
Fig. 6 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides;
Fig. 7 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides;
Fig. 8 is a kind of modular structure schematic diagram for speech recognition equipment that the application another exemplary embodiment provides;
Fig. 9 is a kind of structural schematic diagram for terminal device that the application another exemplary embodiment provides;
Figure 10 is the modular structure signal for another speech recognition equipment that the another exemplary embodiment of the application provides
Figure;
Figure 11 is a kind of structural schematic diagram for server that the application another exemplary embodiment provides;
Figure 12 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides;
Figure 13 is the structural schematic diagram for another terminal device that the application another exemplary embodiment provides;
Figure 14 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides;
Figure 15 is the structural schematic diagram for another server that the application another exemplary embodiment provides;
Figure 16 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides;
Figure 17 is the structural schematic diagram for a kind of electronic equipment that the application another exemplary embodiment provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only the application one
Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Every other embodiment obtained under the premise of creative work out, shall fall in the protection scope of this application.
In the prior art, still immature for the speech recognition schemes of dialect, for the technical problem, the application is implemented
Example provides a solution, and the main thought of the program is: ASR model is constructed for different dialects, in speech recognition process
In, identify that voice wakes up dialect belonging to word in advance, and then selection and voice wake up word in corresponding ASR model never with dialect
The corresponding ASR model of affiliated dialect carries out speech recognition to subsequent voice signal to be identified using selected ASR model,
It realizes the automation of more dialect phonetic identifications, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, without using
Family manual operation, implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.
Below in conjunction with attached drawing, the technical scheme provided by various embodiments of the present application will be described in detail.
Fig. 1 is a kind of structural schematic diagram for speech recognition system that one exemplary embodiment of the application provides.Such as Fig. 1 institute
Show, which includes: server 101 and terminal device 102.Lead between server 101 and terminal device 102
Letter connection.
For example, terminal device 102 can be communicatively coupled by internet and server 101, or can also pass through
Mobile network is communicatively coupled with server 101.If terminal device 102 is communicated by mobile network with server 101
Connection, the network formats of the mobile network can for 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000,
UTMS), any one in 4G (LTE), 4G+ (LTE+), WiMax etc..
Server 101 is mainly directed towards different dialects and provides ASR model, and selects corresponding ASR model under corresponding dialect
Voice signal carries out speech recognition.Server 101 can be it is any can provide calculating service, be able to respond service request, go forward side by side
The equipment of row processing, such as can be General Server, Cloud Server, cloud host, virtual center etc..The composition of server is main
It is similar with general computer architecture including processor, hard disk, memory, system bus etc..
In the present embodiment, terminal device 102 is mainly directed towards user, can provide a user the interface of speech recognition or enter
Mouthful.There are many ways of realization of terminal device 102, for example, can be smart phone, intelligent sound box, PC, wearable device,
Tablet computer etc..Terminal device 102 generally includes at least one processing unit and at least one processor.Processing unit and storage
The quantity of device depends on the configuration and type of terminal device 102.Memory may include volatibility, such as RAM, also can wrap
Non-volatile, such as read-only memory (Read-Only Memory, ROM), flash memory etc. are included, or can also simultaneously include two
Seed type.Operating system (Operating System, OS), one or more application program are typically stored in memory,
Also program data etc. has been can store.Other than processing unit and memory, terminal device 102 further includes some matching substantially
It sets, such as network card chip, IO bus, audio-video component (such as microphone) etc..Optionally, terminal device 102 can also include
Some peripheral equipments, such as keyboard, mouse, input pen, printer etc..These peripheral equipments be in the art it is generally known that
, this will not be repeated here.
In the present embodiment, terminal device 102 and server 101 cooperate, and can provide a user speech recognition function
Energy.Furthermore, it is contemplated that in some cases, terminal device 102 can be used by multiple users, multiple users may hold not Tongfang
Speech.May include following a few class dialects with region zones by taking Chinese as an example: Mandarin dialect, Shanxi language, Hunan language, Jiangxi language, the Wu dialect, Fujian language,
Guangdong language, objective language.Further, some dialects can also segment, such as Fujian language may include Northern Fujian Dialect, the south of Fujian Province words, Min Dong words, Fujian again
Middle words, Pu's celestial being words etc..The pronunciation of different dialects differs greatly, and can not carry out speech recognition with same ASR model.Therefore, at this
In embodiment, ASR model is constructed respectively for different dialects, in order to carry out speech recognition to different dialects.In turn, based on eventually
Mutual cooperation between end equipment 102 and server 101 can provide speech identifying function to the user for holding different dialects, i.e.,
Speech recognition can be carried out to the voice signal for the user for holding different dialects.
In order to improve audio identification efficiency, terminal device 102 supports voice to wake up word function, i.e., when user wants to carry out language
When sound identifies, voice can be inputted to terminal device 102 and wake up word, to wake up speech identifying function.The voice wakes up word and refers to
Determine the voice signal of content of text, such as can be " unlatching ", " day cat is smart ", " hello " etc..Terminal device 102, which receives, to be used
The voice of family input wakes up word, identifies that the voice wakes up dialect belonging to word, and then can determine subsequent voice signal institute to be identified
The dialect (i.e. the voice wakes up dialect belonging to word) of category, mentions to carry out speech recognition using the corresponding ASR model of corresponding dialect
For basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.Wherein, voice wakes up word institute
The first dialect belonged to can be any dialect under any languages.
Terminal device 102 can send to server 101 and take after identifying that voice wakes up the first dialect belonging to word
Business request, service request instruction server 101 never select the corresponding ASR of the first dialect with dialect in corresponding ASR model
Model.The service request that 101 receiving terminal apparatus 102 of server is sent, later according to the instruction of the service request never Tongfang
It says in corresponding ASR model, selects the corresponding ASR model of the first dialect, to be based on the corresponding ASR model of the first dialect to rear
Continue voice signal to be identified and carries out speech recognition.In the present embodiment, it is corresponding to be previously stored with different dialects for server 101
ASR model.ASR model is a kind of model that voice signal can be converted to text.Optionally, a kind of dialect is one corresponding
ASR model or several similar dialects can also correspond to same ASR model, not limit this.Wherein, the first dialect pair
The ASR model answered is used to the voice signal of the first dialect being converted to content of text.
Terminal device 102 continues to send voice to be identified to server 101 after sending service request to server 101
Signal, the voice signal to be identified belong to the first dialect.The voice to be identified letter that 101 receiving terminal apparatus 102 of server is sent
Number, and recognition of speech signals is treated according to the corresponding ASR model of the first dialect of selection and carries out speech recognition, it not only can be to the
One dialect carries out speech recognition, and is conducive to improve the accuracy of speech recognition using matching ASR model.
Optionally, voice signal to be identified can be user after input voice wakes up word, continue defeated to terminal device 102
The voice signal entered is based on this, and terminal device 102 can also receive before sending voice signal to be identified to server 101
The voice signal to be identified of user's input.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in terminal device
102 local voice signals, are based on this, and terminal device 102 directly can obtain voice signal to be identified from local.
In some exemplary embodiments, server 101 can return to speech recognition result or voice to terminal device 102
The related information of recognition result.For example, the content of text that speech recognition goes out can be returned to terminal device 102 by server 101;
Alternatively, the information such as the song to match with speech recognition result, video can also be returned to terminal device 102 by server 101.
Terminal device 102 receives the related information of the speech recognition result that server 101 returns or speech recognition result, and according to voice
Recognition result or the related information of speech recognition result execute subsequent processing.For example, terminal device 102 is receiving speech recognition
After content of text out, content of text can be showed to user, or web search etc. can be carried out based on content of text.
It, can be in another example terminal device 102 is after the information such as the related information for receiving speech recognition result, such as song, video
The information such as song, video are played, or the information such as song, video can also be transmitted to other users, to realize information point
It enjoys.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency identified to more dialect phonetics.
In each embodiment of the application, does not limit terminal device 102 and identify that voice wakes up the first dialect belonging to word
Mode, all modes that may recognize that the first dialect belonging to voice wake-up word are suitable for each embodiment of the application.In this Shen
It please enumerate several terminal devices 102 below in some exemplary embodiments and identify that voices wake up the mode of the affiliated dialects of word:
Voice is waken up word and wakes up word progress acoustics from the benchmark recorded with different dialects respectively by mode 1, terminal device 102
The Dynamic Matching of feature obtains the corresponding dialect of benchmark wake-up word for meeting the first sets requirement with the matching degree that voice wakes up word
As the first dialect.
In mode 1, benchmark is recorded with different dialects in advance and wakes up word.Wherein, it is waken up with the benchmark that different dialects are recorded
Word is identical as the voice wake-up content of text of word.Since the sound generating mechanism for holding the user of different dialects is different, with the record of different dialects
The acoustic feature of the benchmark keyword of system is different.Based on this, terminal device 102 prerecords benchmark with different dialects and wakes up word,
It is waiting receive user input voice wake up word after, by voice wake up word respectively from is recorded with different dialects benchmark wake-up word into
The Dynamic Matching of row acoustic feature, to obtain waking up the matching degree of word from different benchmark.Wherein, not according to application scenarios
Together, the first sets requirement can be different.For example, the highest benchmark of the matching degree for waking up word with voice can be waken up corresponding to word
Dialect as the first dialect;Alternatively, a matching degree threshold value also can be set, the matching degree that word is waken up with voice is greater than and is matched
The benchmark for spending threshold value wakes up dialect corresponding to word as the first dialect;Or a matching degree range also can be set, it will be with language
The matching degree that sound wakes up word falls into the benchmark within the scope of the matching degree and wakes up dialect corresponding to word as the first dialect.
In mode 1, acoustic feature can be presented as the temporal signatures and frequency domain character of voice signal.Based on temporal signatures
With there are many matching process of frequency domain character, optionally, can based on dynamic time bending (dynamic time warping,
DTW) method wakes up the Dynamic Matching that word carries out time series to voice.
Dynamic time bending method is the method for the similarity between a kind of two time serieses of measurement.Terminal device 102
Word is waken up according to the voice of input and generates the time series that voice wakes up word, and is waken up respectively from the benchmark recorded with different dialects
The time series of word compares.Between two time serieses for participating in comparing, at least a pair of of similitude is determined.It will be between similitude
Sum of the distance, i.e. consolidation path distance, to measure the similitude between two time serieses.It is alternatively possible to will be with voice
The smallest benchmark of regular path distance for waking up word wakes up dialect corresponding to word as the first dialect;Also a distance can be set
The benchmark that the regular path distance that word is waken up with voice is less than distance threshold is waken up dialect corresponding to word as first by threshold value
Dialect;One distance range can also be set, the regular path distance for waking up word with voice is fallen into the benchmark in the distance range
Dialect corresponding to word is waken up as the first dialect.
Mode 2, terminal device 102 identify that voice wakes up the acoustic feature of word, and the acoustic feature that voice wakes up word is distinguished
It is matched from the acoustic feature of different dialects, the matching degree for obtaining the acoustic feature for waking up word with voice meets the second setting and wants
The dialect asked is as the first dialect.
In mode 2, the acoustic feature of different dialects is obtained in advance, and the acoustic feature of word is waken up by identification voice, into
And determine that voice wakes up the first dialect belonging to word based on the matching between acoustic feature.
Optionally, identification voice wake up word acoustic feature before, can to voice wake up word be filtered and
Digitlization.Filtering processing refers to that retaining voice wakes up signal of the frequency in 300~3400Hz in word.Digitlization refers to the letter to reservation
Number carry out A/D conversion and anti-aliasing processing.
It is alternatively possible to the spectrum signature parameter of word be waken up by calculating voice, such as sliding difference cepstrum parameter, to know
Other voice wakes up the acoustic feature of word.Similar with mode 1, according to the difference of application scenarios, the second sets requirement can be different.Example
Such as, the highest benchmark of the matching degree for the acoustic feature that word can will be waken up with voice wakes up dialect corresponding to word as first party
Speech;Also a matching degree threshold value can be set, the matching degree that the acoustic feature of word will be waken up with voice is greater than the base of matching degree threshold value
Standard wakes up dialect corresponding to word as the first dialect;One matching degree range can also be set, the acoustics of word will be waken up with voice
The matching degree of feature falls into the benchmark within the scope of the matching degree and wakes up dialect corresponding to word as the first dialect.
Wherein, sliding difference cepstrum parameter is made of several pieces of difference cepstrums across multiframe voice, it is contemplated that before and after frames are poor
The influence for pouring in separately spectrum has incorporated more temporal aspect.It compares benchmark and wakes up the sliding difference cepstrum parameter of word and with not Tongfang
The sliding difference cepstrum parameter that the benchmark that speech is recorded wakes up word optionally will wake up the sliding difference cepstrum parameter of word with benchmark
The highest benchmark of matching degree wakes up dialect corresponding to word as the first dialect;Also a parameter difference threshold value can be set, it will be with base
The voice that the difference that standard wakes up the sliding difference cepstrum parameter of word is less than parameter difference threshold value wakes up dialect corresponding to word as first
Dialect;One parameter difference range can also be set, by and the difference of the benchmark sliding difference cepstrum parameter that wakes up word fall into the parameter difference
Benchmark in range wakes up dialect corresponding to word as the first dialect.
Mode 3, is converted into text for voice wake-up word and wakes up word, and text is waken up word base corresponding from different dialects respectively
Quasi- text wakes up word and is matched, and obtains the benchmark text wake-up word that the matching degree for waking up word with text meets third sets requirement
Corresponding dialect is as the first dialect.
In mode 3, it is the text that voice wakes up that word is converted into after speech recognition that text, which wakes up word, and different dialects are corresponding
Benchmark text to wake up word be that the corresponding benchmark of different dialects wakes up the text being converted into after word speech recognition.Optionally, for
Text wakes up word and the corresponding benchmark text of different dialects wakes up word, can carry out rough language using identical speech recognition modeling
Sound identification, to improve the efficiency of entire speech recognition process.Alternatively, can also be preparatory using the corresponding ASR model of different dialects
It word is waken up to the corresponding benchmark of different dialects carries out being converted to corresponding benchmark text after speech recognition and wake up word, when receiving language
After sound wakes up word, a kind of corresponding ASR model of dialect can be successively selected, and wake up to voice based on selected ASR model
Word carries out speech recognition to obtain text and wake up word, and the text after conversion is waken up word benchmark text corresponding with this kind of dialect
It wakes up word to be matched, if the corresponding benchmark text of this kind of dialect wakes up word and the matching degree of text wake-up word meets third setting
It is required that then using this kind of dialect as the first dialect.Conversely, if the corresponding benchmark text of this kind of dialect wakes up word and text wakes up word
Matching degree do not meet third sets requirement, then continue to wake up word to text according to a kind of corresponding ASR model of lower dialect carrying out
Text is converted to after speech recognition and wakes up word, and the text after conversion is waken up into word benchmark text corresponding with this kind of dialect and is waken up
Word is matched, and the benchmark text that third sets requirement is met until obtaining the matching degree for waking up word with text wakes up word, and will
Benchmark text wakes up the corresponding dialect of word as voice and wakes up the first dialect belonging to word.
Optionally, similar with mode 1, mode 2, the highest benchmark text of the matching degree for waking up word with text can be waken up
Dialect corresponding to word is as the first dialect;Also a matching degree threshold value can be set, the matching degree for waking up word with text is greater than
The benchmark text of matching degree threshold value wakes up dialect corresponding to word as the first dialect;One matching degree range can also be set, it will
The matching degree for waking up word with text falls into the benchmark text within the scope of the matching degree and wakes up dialect corresponding to word as first party
Speech.
It is worth noting that the first sets requirement, the second sets requirement and third sets requirement can be identical, it can also not
Together.
In some exemplary embodiments, terminal device 102 is that mobile phone, computer, wearable device etc. have setting for display screen
It is standby, then a voice input interface can be shown on a display screen, and the text information of user's input is obtained by voice input interface
And/or voice signal.Optionally, when user needs to carry out speech recognition, the unlatching key of pressing terminal device can be passed through
Perhaps the modes such as the display screen of terminal device 102 are touched and send the instruction opened or activated to terminal device 102.Terminal is set
Standby 102 may be in response to the instruction for being activated or switched on itself, show voice input interface to user on a display screen.Optionally, language
The icon of microphone or the text information of similar " waking up word input " can be shown on sound input interface, to indicate that user inputs
Voice wakes up word.In turn, the voice that terminal device 102 can obtain user's input based on voice input interface wakes up word.
In some exemplary embodiments, terminal device 102, which can be mobile phone, computer, intelligent sound box etc. and have voice, broadcasts
The equipment of playing function.Based on this, terminal device 102 is after sending service request to server 101, and to server
Before 101 send voice signal to be identified, voice input prompt information, such as the languages such as " please speak ", " asking program request " can be exported
Sound signal, to prompt user to carry out voice input.It for users, can be defeated in the voice after input voice wakes up word
Under the prompt for entering prompt tone, voice signal to be identified is inputted to terminal device 102.Terminal device 102 receive user input to
Voice signal to be identified is sent to server 101 by recognition of speech signals, by server 101 according to the corresponding ASR of the first dialect
Model treats recognition of speech signals and carries out speech recognition.
In other exemplary embodiments, terminal device 102 can be mobile phone, computer, wearable device etc. and have display
The equipment of screen.Based on this, terminal device 102 is sent out after sending service request to server 101, and to server 101
It before sending voice signal to be identified, can show that voice inputs prompt information in a manner of text or icon etc., such as similar " please say
Text, the microphone icon etc. of words ", to prompt user to carry out voice input.For users, input voice wake up word it
Afterwards, voice signal to be identified can be inputted to terminal device 102 under the prompt that the voice inputs prompt information.Terminal device
102 receive the voice signal to be identified of user's input, voice signal to be identified are sent to server 101, by server 101
Recognition of speech signals, which is treated, according to the corresponding ASR model of the first dialect carries out speech recognition.
In other exemplary embodiment, terminal device 102 can have indicator light.Based on this, terminal device 102 exists
After sending service request to server 101, and before sending voice signal to be identified to server 101, it can light
Indicator light, to prompt user to carry out voice input.It for users, can be in the indicator light after input voice wakes up word
Prompt under, input voice signal to be identified to terminal device 102.Terminal device 102 receives the voice to be identified of user's input
Voice signal to be identified is sent to server 101 by signal, is treated by server 101 according to the corresponding ASR model of the first dialect
Recognition of speech signals carries out speech recognition.
It is worth noting that terminal device 102 can be provided simultaneously with voice play function, indicator light, in display screen extremely
It is two kinds or three kinds few.Based on this, terminal device 102 by audible, in a manner of text or icon and can be lighted simultaneously
Two or three in the mode of indicator light, output voice inputs prompt information, to reinforce the interaction effect with user.
In some exemplary embodiments, terminal device 102 is mentioned in output voice input prompt tone or output voice input
Before showing information or lighting indicator light, it may be predetermined that server 101 selected the corresponding ASR model of the first dialect, so as to
It can be directly according to selected in the server 101 after the voice signal to be identified for inputting user is sent to server 101
ASR model is treated recognition of speech signals and is identified.Based on this, server 101 selects in ASR model corresponding never with dialect
After selecting the corresponding ASR model of the first dialect, notification message is returned to terminal device 102, which, which is used to indicate, has selected
Select the corresponding ASR model of the first dialect.Based on this, terminal device 102 can also receive the notification message of the return of server 101,
And then know that server 101 selected the corresponding ASR model of the first dialect based on the notification message.In turn, terminal device 102 exists
After the notification message for receiving the return of server 101, voice input prompt tone, or output voice input prompt letter can be exported
Breath, or indicator light is lighted, to prompt user to carry out voice input.
In each embodiment of the application, server 101 needs to construct before selecting the corresponding ASR model of the first dialect
The corresponding ASR model of different dialects.Wherein, the process that server 101 constructs the corresponding ASR model of different dialects specifically includes that
Collect the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, to obtain the acoustic feature of different dialects;According to
The acoustic feature of different dialects constructs the corresponding ASR model of different dialects.About the corresponding ASR model of every kind of dialect of building
Detailed process can be found in the prior art, and details are not described herein.
It is alternatively possible to pass through the corpus of network collection difference dialect, or can also be to a large amount of use for holding different dialects
Family carries out voice recording, to obtain the corpus of different dialects.
It optionally, can be to the language for the different dialects being collected into before the corpus to different dialects carries out feature extraction
Material is pre-processed.Preprocessing process includes carrying out preemphasis processing, windowing process, endpoint detection processing to voice.To difference
After the corpus of dialect is pre-processed, feature extraction can be carried out to voice.The feature of voice includes that temporal signatures and frequency domain are special
Sign.Wherein, temporal signatures include short-time average energy, short-time average zero-crossing rate, formant, pitch period etc., frequency domain character packet
Include linear predictor coefficient, LPC cepstrum coefficient, line spectrum pairs parameter, short-term spectrum, Mel frequency cepstral coefficient etc..
In the following, illustrating the process that acoustic feature extracts for extracting Mel frequency cepstral coefficient.First with human ear
Several bandpass filters are arranged in perception characteristics in the spectral range of voice, and each bandpass filter is with triangle or just
Then string shape filtering characteristic is included in energy information in the characteristic vector that bandpass filter is filtered corpus, calculate
The signal energy of several bandpass filters, then Mel frequency cepstral coefficient is calculated by discrete cosine transform.
After the acoustic feature for obtaining different dialects, using the acoustic feature of different dialects as input, with different dialects
For the corresponding text of corpus as exporting, the parameter in the corresponding initial model of the different dialects of training is corresponding to obtain different dialects
ASR model.Optionally, ASR model includes but is not limited to model, the neural network model etc. constructed based on vector quantization method.
Below by taking the application scenarios that multiple user's using terminal equipment for holding different dialects are requested a song as an example, to above-mentioned reality
Example is applied to be described in detail.
The terminal device for having song ordering function can be intelligent sound box, and optionally, which has a display screen,
It is " hello " that the preset voice of the intelligent sound box, which wakes up word,.When the Guangdong language user for holding Cantonise dialect wants requesting song, Guangdong language user
First touch display screen with input activate the intelligent sound box instruction, intelligent sound box in response to activated terminals equipment instruction,
Voice input interface is shown on display screen, shows " hello " text on voice input interface.Guangdong language user inputs boundary to voice
Face inputs the voice signal of " hello ".Intelligent sound box obtains the voice letter of " hello " of user's input based on voice input interface
Number, and identify that " hello " belongs to Cantonise dialect;Then, service request is sent to server, with the never same dialect of request server
The corresponding ASR model of Cantonise dialect is selected in corresponding ASR model.After server receives service request, Cantonise dialect is selected
Corresponding ASR model, and notification message is returned to intelligent sound box, it is corresponding which is used to indicate selected Cantonise dialect
ASR model.Then, intelligent sound box output voice inputs prompt information, such as " please input voice ", to prompt user to carry out voice
Input.Guangdong language user inputs the voice signal of song title " Five-Starred Red Flag (the national flag of the People's Republic of China) " under the prompt of voice input prompt information.Intelligent sound
Case receives the voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " of Guangdong language user input, and voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " is sent to server.Service
Device carries out speech recognition to voice signal " Five-Starred Red Flag (the national flag of the People's Republic of China) " using the corresponding ASR model of Cantonise dialect to obtain text information " five
The song to match with " Five-Starred Red Flag (the national flag of the People's Republic of China) " is issued to intelligent sound box by star red flag ", so that intelligent sound box plays the song.
Similarly, after the Guangdong language user requesting song for holding Cantonise dialect terminates, it is assumed that the Tibetan language user for holding Tibetan dialect thinks
It requests a song.At this point, Tibetan language user can input the voice signal of " hello " on the voice input interface that intelligent sound box is shown.Intelligence
Energy speaker identification " hello " belongs to Tibetan dialect;Then, service request is sent to server, with the never same dialect of request server
The corresponding ASR model of Tibetan dialect is selected in corresponding ASR model.After server receives service request, Tibetan dialect is selected
Corresponding ASR model, and notification message is returned to intelligent sound box, it is corresponding which is used to indicate selected Tibetan dialect
ASR model.Then, intelligent sound box output voice inputs prompt information, such as " please input voice ", to prompt user to carry out voice
Input.Tibetan language user inputs the voice signal of song title " my motherland " under the prompt of voice input prompt information.Intelligent sound
Case receives the voice signal " my motherland " of user's input, and voice signal " my motherland " is sent to server.Server
Using the corresponding ASR model of Tibetan dialect to voice signal " my motherland " carry out speech recognition with obtain text information " I
The song to match with " my motherland " is issued to intelligent sound box, so that intelligent sound box plays the song by motherland ".
In the application scenarios, using audio recognition method provided by the embodiments of the present application, as the user for holding different dialects
When requesting a song using same intelligent sound box, it is not necessarily to user's manual switching ASR model, voice wake-up word need to be only inputted with corresponding dialect is
Can, intelligent sound can automatic identification voice wake up dialect and then request server belonging to word and start the corresponding ASR of corresponding dialect
Model identifies the song title of user's point, and while supporting multi-party speech automation requesting song, the efficiency of requesting song can be improved.
Fig. 2 is a kind of flow diagram for audio recognition method that the application another exemplary embodiment provides.The implementation
Example can be based on the realization of speech recognition system shown in Fig. 1, the description mainly carried out from the angle of terminal device.As shown in Fig. 2, should
Method includes:
21, it receives voice and wakes up word.
22, identification voice wakes up the first dialect belonging to word.
23, service request is sent to server, selects first in corresponding ASR model never with dialect with request server
The corresponding ASR model of dialect.
24, voice signal to be identified is sent to server, so that the corresponding ASR model of the first dialect of server by utilizing is treated
Recognition of speech signals carries out speech recognition.
When user wants to carry out speech recognition, voice can be inputted to terminal device and wakes up word, which wakes up word and refer to
Determine the voice signal of content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Terminal device receives the language of user's input
Sound wakes up word, identifies that the voice wakes up dialect belonging to word, and then can determine dialect belonging to subsequent voice signal to be identified (i.e.
The voice wakes up dialect belonging to word), basis is provided to carry out speech recognition using the corresponding ASR model of corresponding dialect.For just
It in description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.
Then, terminal device sends service to server and asks after identifying that voice wakes up the first dialect belonging to word
It asks, service request instruction server is never the same as selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.It connects
, voice signal to be identified is sent to server by terminal device.Server upon receipt of a service request, never with dialect pair
The corresponding ASR model of the first dialect is selected in the ASR model answered, and passes through the corresponding ASR model pair of selected first dialect
The voice signal to be identified received is identified.
In the present embodiment, terminal device identifies that voice wakes up the first dialect belonging to word, and sends service to server and ask
It asks, so that server is convenient for being based on first never with the corresponding ASR model of the first dialect is selected in the corresponding ASR model of dialect
The corresponding ASR model of dialect carries out speech recognition to subsequent voice signal to be identified, realizes the automatic of more dialect phonetic identifications
Change, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, is manually operated, is implemented more without user
It is convenient, fast, be conducive to the efficiency for improving more dialect phonetic identifications.Further, more brief based on voice wake-up word, identification
The process time-consuming that voice wakes up dialect belonging to word is shorter, and speech recognition system is enabled quickly to identify that voice wakes up belonging to word
The first dialect, and select corresponding with the first dialect ASR model, the efficiency identified into a raising to voice to be identified.
In some exemplary embodiments, a kind of mode of the first dialect belonging to above-mentioned identification voice wake-up word includes:
Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature, acquisition and voice from the benchmark recorded with different dialects respectively
The matching degree for waking up word meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect.Alternatively, above-mentioned knowledge
Other voice wake up the first dialect belonging to word another way include: by voice wake up the acoustic feature of word respectively with not Tongfang
The acoustic feature of speech is matched, and the matching degree for obtaining the acoustic feature for waking up word with voice meets the dialect of the second sets requirement
As the first dialect.Alternatively, another mode that above-mentioned identification voice wakes up the first dialect belonging to word includes: or by voice
It wakes up word and is converted into text wake-up word, text is waken up word, and benchmark text corresponding from different dialects wakes up word progress respectively
Match, obtains and meet the corresponding dialect of benchmark text wake-up word of third sets requirement as first with the matching degree that text wakes up word
Dialect.
In some exemplary embodiments, a kind of mode that above-mentioned reception voice wakes up word includes: in response to activating or opening
The instruction for opening terminal device shows voice input interface to user;The voice for obtaining user's input based on voice input interface is called out
Awake word.
In some exemplary embodiments, before sending voice signal to be identified to server, this method further include: defeated
Voice inputs prompt information out, to prompt user to carry out voice input;Receive the voice signal to be identified of user's input.
In some exemplary embodiments, before output voice input prompt information, this method further include: receive service
The notification message that device returns, the notification message are used to indicate the corresponding ASR model of selected first dialect.
Fig. 3 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality
Applying example can be realized based on speech recognition system shown in Fig. 1, the description mainly carried out from the angle of server.As shown in figure 3, should
Method includes:
31, the service request that receiving terminal apparatus is sent, the corresponding ASR mould of service request instruction the first dialect of selection
Type.
32, never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, the first dialect is voice
Wake up dialect belonging to word.
33, the voice signal to be identified that receiving terminal apparatus is sent, and knowledge is treated using the corresponding ASR model of the first dialect
Other voice signal carries out speech recognition.
In the present embodiment, terminal device is sent after identifying that voice wakes up the first dialect belonging to word to server
Service request.Server selects the first dialect from the corresponding ASR model of pre-stored difference dialect according to service request
Corresponding ASR model, and then can be that subsequent voice signal carries out speech recognition based on the corresponding ASR model of the first dialect, it realizes
The automation of more dialect phonetics identifications, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is not necessarily to user
Manual operation, implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency of more dialect phonetics identification.
In some exemplary embodiments, server needs to construct not before selecting the corresponding ASR model of the first dialect
The corresponding ASR model with dialect.Wherein, a kind of process constructing the corresponding ASR model of different dialects specifically includes that collection is different
The corpus of dialect;Feature extraction is carried out to the corpus of different dialects, to obtain the acoustic feature of different dialects;According to different dialects
Acoustic feature, construct the corresponding ASR model of different dialects.
In some exemplary embodiments, recognition of speech signals progress is being treated based on the corresponding ASR model of the first dialect
After speech recognition, the related information of speech recognition result or speech recognition result can be sent to terminal device, for end
End equipment executes subsequent processing based on the related information of speech recognition result or speech recognition result.
Fig. 4 is the structural schematic diagram for another speech recognition system that the application another exemplary embodiment provides.Such as Fig. 4
Show, which includes: server 401 and terminal device 402.Lead between server 401 and terminal device 402
Letter connection.
The framework phase of the framework of speech recognition system 400 provided in this embodiment and speech recognition system 100 shown in fig. 1
Together, difference is that the function of server 401 and terminal device 402 in speech recognition process is different.About terminal in Fig. 4
The way of realization and communication connection mode of equipment 402 and server 401 can be found in the description of embodiment illustrated in fig. 1, herein not
It repeats again.
It is similar with speech recognition system 100 shown in Fig. 1, in speech recognition system 400 shown in Fig. 4, terminal device 402 with
Server 401 cooperates, and can also provide a user speech identifying function.Moreover, it is contemplated that terminal is set in some cases
Standby 402 may be used by multiple users, and multiple users may hold different dialects, then, in speech recognition system 400,
ASR model is constructed respectively for different dialects, it in turn, can based on the mutual cooperation between terminal device 402 and server 401
To provide speech identifying function to the user for holding different dialects, it can carry out language to the voice signal for the user for holding different dialects
Sound identification.
In speech recognition system 400 shown in Fig. 4, terminal device 402 also supports voice to wake up word function, but terminal device
402 voices for being mainly used for receiving user's input wake up word and are reported to server 401 so that server 401 identifies that voice wakes up
Dialect belonging to word, this point are different from the terminal device 102 in embodiment illustrated in fig. 1.Correspondingly, speech recognition shown in Fig. 4
In system 400, server 401 is in addition to providing ASR model towards different dialects and selecting corresponding ASR model under corresponding dialect
Voice signal carries out except speech recognition, also has the function of identifying that voice wakes up the affiliated dialect of word.
It speech recognition system 400 based on shown in Fig. 4 can be to terminal device 402 when user wants to carry out speech recognition
Input voice and wake up word, which wakes up the voice signal that word is specified content of text, such as " unlatching ", " day cat is smart ",
" hello " etc..The voice that terminal device 402 receives user's input wakes up word, and voice wake-up word is sent to server
401.After the voice that server 401 receives the transmission of terminal device 402 wakes up word, identify that the voice wakes up dialect belonging to word.
It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the first dialect.Wherein, the first dialect finger speech sound wakes up word
Affiliated dialect, such as can be Mandarin dialect, Shanxi language or Hunan language etc..Then, the never corresponding ASR with dialect of server 401
In model, the corresponding ASR model of the first dialect is selected, in order to which the subsequent corresponding ASR model of the first dialect that is based on is to first party
Voice signal under speech carries out speech recognition.In the present embodiment, server 401 is previously stored with the corresponding ASR of different dialects
Model.Optionally, a kind of corresponding ASR model of dialect or several similar dialects can also correspond to same ASR mould
Type does not limit this.Wherein, the corresponding ASR model of the first dialect is used to the voice signal of the first dialect being converted to text
Content.
Terminal device 402 continues to send language to be identified to server 401 after sending voice to server 401 and waking up word
Sound signal.The voice signal to be identified that 401 receiving terminal apparatus 402 of server is sent, and utilize the corresponding ASR mould of the first dialect
Type treats recognition of speech signals and carries out speech recognition.Optionally, voice signal to be identified can be user and wake up in input voice
After word, continue the voice signal inputted to terminal device 402, is based on this, terminal device 402 is sent to server 401 wait know
Before other voice signal, the voice signal to be identified of user's input can also be received.Alternatively, voice signal to be identified is also possible to
It is prerecorded and stored in the local voice signal of terminal device 402.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency of more dialect phonetics identification.
In some exemplary embodiments, server 401 identifies that voice wakes up a kind of mode of the first dialect belonging to word
Include: that voice is waken up word to wake up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtains
The corresponding dialect of word is waken up as the first dialect with the benchmark that the matching degree that voice wakes up word meets the first sets requirement.
In other exemplary embodiments, server 401 identifies that voice wakes up the another kind of the first dialect belonging to word
Mode includes: to match the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, and acquisition is called out with voice
The matching degree of the acoustic feature of awake word meets the dialect of the second sets requirement as the first dialect.
In other exemplary embodiment, server 401 identifies that voice wakes up another of the first dialect belonging to word
Mode includes: that voice wake-up word is converted into text to wake up word, and text is waken up word benchmark text corresponding from different dialects respectively
This wake-up word is matched, obtain with text wake up word matching degree meet third sets requirement benchmark text wake-up word it is corresponding
Dialect as the first dialect.
Wherein, server 401 identifies that voice wakes up the mode of the first dialect belonging to word and terminal device 102 identifies voice
The mode for waking up the first dialect belonging to word is similar, and detailed description can be found in previous embodiment, and details are not described herein.
In some exemplary embodiments, terminal device 402 receive voice wake up word mode include: in response to activation or
The instruction of opening terminal apparatus shows voice input interface to user;The voice of user's input is obtained based on voice input interface
Wake up word.
In some exemplary embodiments, terminal device 402 is before sending voice signal to be identified to server 401,
Voice input prompt information can be exported, to prompt user to carry out voice input;Later, the voice to be identified of user's input is received
Signal.
In some exemplary embodiments, terminal device 402 can receive clothes before output voice input prompt information
The notification message that business device 401 returns, the notification message are used to indicate the corresponding ASR model of selected first dialect.Based on this, eventually
After end equipment 402 can selected the corresponding ASR model of the first dialect determining server 401, it is defeated that voice is exported to user
Enter prompt information, to prompt user to carry out voice input, can be sent in this way in the voice signal to be identified for inputting user
Server 401 directly can treat recognition of speech signals according to selected ASR model and be identified after server 401.
In some exemplary embodiments, server 401 selects first party in ASR model corresponding never with dialect
Before saying corresponding ASR model, the expectation of different dialects can be collected;Feature extraction is carried out to the expectation of different dialects, with
To the acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.About structure
The detailed process for building the corresponding ASR model of every kind of dialect can be found in the prior art, and details are not described herein.
In some exemplary embodiments, server 401 can return to speech recognition result or voice to terminal device 402
The related information of recognition result.For example, the content of text that speech recognition goes out can be returned to terminal device 402 by server 401;
Alternatively, the information such as the song to match with speech recognition result, video can also be returned to terminal device 402 by server 401.
Terminal device 402 receives the related information of the speech recognition result that server 401 returns or speech recognition result, and according to voice
Recognition result or the related information of speech recognition result execute subsequent processing.
Fig. 5 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality
Apply example speech recognition system can realize based on shown in Fig. 4, the description mainly carried out from the angle of terminal device.As shown in figure 5,
This method comprises:
51, it receives voice and wakes up word.
52, voice is sent to server and wake up word, wake up the word never corresponding ASR with dialect so that server is based on voice
Voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in model.
53, voice signal to be identified is sent to server, so that the corresponding ASR model of the first dialect of server by utilizing is treated
Recognition of speech signals carries out speech recognition.
When user wants to carry out speech recognition, voice can be inputted to terminal device and wakes up word, which, which wakes up word, is
The voice signal of specified content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Terminal device receives what user sent
Voice wakes up word, and sends voice to server and wake up word, so that server identifies that the voice wakes up dialect belonging to word, in turn
It can determine dialect belonging to subsequent voice signal to be identified (i.e. the voice wakes up dialect belonging to word), to use corresponding dialect pair
The ASR model answered carries out speech recognition and provides basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as the
One dialect.
Then, server first dialect according to belonging to voice wake-up word, selects in corresponding ASR model never with dialect
Voice wakes up the corresponding ASR model of the first dialect belonging to word.Then, terminal device continues to send voice to be identified to server
Signal treats recognition of speech signals for the corresponding ASR model of the first dialect of server by utilizing and carries out speech recognition.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
In some exemplary embodiments, it includes: in response to being activated or switched on terminal device that above-mentioned reception voice, which wakes up word,
Instruction, to user show voice input interface;The voice for obtaining user's input based on voice input interface wakes up word.
In some exemplary embodiments, before sending voice signal to be identified to server, this method further include: defeated
Voice inputs prompt information out, to prompt user to carry out voice input;Receive the voice signal to be identified of user's input.
In some exemplary embodiments, before output voice input prompt information, this method further include: receive service
The notification message that device returns, notification message are used to indicate the corresponding ASR model of selected first dialect.
Fig. 6 is the flow diagram for another audio recognition method that the application another exemplary embodiment provides.The reality
Apply example speech recognition system can realize based on shown in Fig. 4, the description mainly carried out from the angle of server.As shown in fig. 6, should
Method includes:
61, the voice that receiving terminal apparatus is sent wakes up word.
62, identification voice wakes up the first dialect belonging to word.
63, never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected.
64, the voice signal to be identified that receiving terminal apparatus is sent, and knowledge is treated using the corresponding ASR model of the first dialect
Other voice signal carries out speech recognition.
The voice that server receiving terminal equipment is sent wakes up word, identifies that the voice wakes up dialect belonging to word, Jin Erke
It determines dialect belonging to subsequent voice signal to be identified (i.e. the voice wakes up dialect belonging to word), is corresponding using corresponding dialect
ASR model carry out speech recognition provide basis.It for ease of description and distinguishes, voice is waken up into dialect belonging to word and is denoted as first
Dialect.
Then, server selects the corresponding ASR of the first dialect from the corresponding ASR model of pre-stored difference dialect
Model, and then can be that subsequent voice signal carries out speech recognition based on the corresponding ASR model of the first dialect, realize multi-party speech
The automation of sound identification, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated without user,
It implements more convenient, quick, is conducive to improve the efficiency of more dialect phonetics identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency of more dialect phonetics identification.
In some exemplary embodiments, a kind of mode of the first dialect belonging to above-mentioned identification voice wake-up word includes:
Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature, acquisition and voice from the benchmark recorded with different dialects respectively
The matching degree for waking up word meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect.
In other exemplary embodiments, above-mentioned identification voice wakes up the another way packet of the first dialect belonging to word
It includes: the acoustic feature that voice wakes up word is matched from the acoustic feature of different dialects respectively, obtain and wake up word with voice
The matching degree of acoustic feature meets the dialect of the second sets requirement as the first dialect.
In other exemplary embodiment, above-mentioned identification voice wakes up another mode packet of the first dialect belonging to word
It includes: voice wake-up word being converted into text and wakes up word, text is waken up word, and benchmark text corresponding from different dialects wakes up respectively
Word is matched, and the corresponding dialect of benchmark text wake-up word for meeting third sets requirement with the matching degree that text wakes up word is obtained
As the first dialect.
In some exemplary embodiments, in ASR model corresponding never with dialect, select the first dialect corresponding
Before ASR model, this method further include: collect the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, with
To the acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.
In some exemplary embodiments, server can return to speech recognition result or speech recognition knot to terminal device
The related information of fruit.For example, the content of text that speech recognition goes out can be returned to terminal device by server;Alternatively, can also be with
The information such as the song to match with speech recognition result, video are returned into terminal device.
In the above embodiments, the speech recognition sayed in many ways is executed with terminal device and server, but and unlimited
In this.For example, more dialect phonetics can be known if the processing function of terminal device or server is powerful enough with store function
Other function is individually integrated on terminal device or server and realizes.Based on this, the application another exemplary embodiment provides one
The audio recognition method that kind is independently implemented by server or terminal device.In order to describe simplicity, in the following embodiments, will service
Device and terminal device are collectively referred to as electronic equipment.As shown in fig. 7, the speech recognition side independently implemented by server or terminal device
Method the following steps are included:
71, it receives voice and wakes up word.
72, identification voice wakes up the first dialect belonging to word.
73, never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.
74, recognition of speech signals is treated using the corresponding ASR model of the first dialect carry out speech recognition.
When user wants to carry out speech recognition, voice can be inputted to electronic equipment and wakes up word, which, which wakes up word, is
The voice signal of specified content of text, such as " unlatching ", " day cat is smart ", " hello " etc..Electronic equipment receives what user sent
Voice wakes up word, and identifies that voice wakes up the first dialect belonging to word.Wherein, the first dialect finger speech sound wakes up side belonging to word
Speech, such as Mandarin dialect, Shanxi language, Hunan language etc..
Then, electronic equipment in corresponding ASR model, selects the corresponding ASR model of the first dialect never with dialect, so as to
Speech recognition is carried out to subsequent voice signal to be identified based on the first dialect corresponding ASR model.In the present embodiment, electronics is set
It is standby to be previously stored with the corresponding ASR model of different dialects.Optionally, a kind of corresponding ASR model of dialect or a few types
As dialect can also correspond to same ASR model, do not limit this.Wherein, the corresponding ASR model of the first dialect is used for the
The voice signal of one dialect is converted to content of text.
Electronic equipment can be treated after selecting the corresponding ASR model of the first dialect using the corresponding ASR model of the first dialect
Recognition of speech signals carries out speech recognition.Optionally, voice signal to be identified can be user after input voice wakes up word, after
Continue the voice signal inputted to electronic equipment, be based on this, electronic equipment is in the corresponding ASR model of the first dialect of utilization to be identified
Before voice signal carries out speech recognition, the voice signal to be identified of user's input can also be received.Alternatively, voice letter to be identified
Number it is also possible to be prerecorded and stored in the voice signal of electronic equipment local, is based on this, electronic equipment can be directly from this
Ground obtains voice signal to be identified.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency of more dialect phonetics identification.
In some exemplary embodiments, a kind of mode of the first dialect belonging to above-mentioned identification voice wake-up word includes:
Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature, acquisition and voice from the benchmark recorded with different dialects respectively
The matching degree for waking up word meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect.
In other exemplary embodiments, above-mentioned identification voice wakes up the another way packet of the first dialect belonging to word
It includes: the acoustic feature that voice wakes up word is matched from the acoustic feature of different dialects respectively, obtain and wake up word with voice
The matching degree of acoustic feature meets the dialect of the second sets requirement as the first dialect.
In other exemplary embodiment, above-mentioned identification voice wakes up another mode packet of the first dialect belonging to word
It includes: voice wake-up word being converted into text and wakes up word, text is waken up word, and benchmark text corresponding from different dialects wakes up respectively
Word is matched, and the corresponding dialect of benchmark text wake-up word for meeting third sets requirement with the matching degree that text wakes up word is obtained
As the first dialect.
In some exemplary embodiments, above-mentioned reception voice wakes up word, comprising: in response to being activated or switched on terminal device
Instruction, to user show voice input interface;The voice for obtaining user's input based on voice input interface wakes up word.
In some exemplary embodiments, recognition of speech signals progress is being treated using the corresponding ASR model of the first dialect
Before speech recognition, this method further include: output voice inputs prompt information, to prompt user to carry out voice input;It receives and uses
The voice signal to be identified of family input.
In some exemplary embodiments, the corresponding ASR of the first dialect is never being selected in corresponding ASR model with dialect
Before model, this method further include: collect the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, to obtain
The acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.
In some exemplary embodiments, recognition of speech signals progress is being treated based on the corresponding ASR model of the first dialect
After speech recognition, electronic equipment can execute subsequent place based on the related information of speech recognition result or speech recognition result
Reason.
It is worth noting that voice wake-up word can be preset in the above embodiments of the present application or following embodiments;
Alternatively, also can permit the customized wake-up word of user.Here customized wake-up word or the preset word that wakes up are primarily referred to as waking up word
Content and/or tone etc..Wherein, the function of customized voice wake-up word can be realized by terminal device, can also be by server
To realize.Optionally, the function that customized voice wakes up word can be provided by the equipment that identification voice wakes up the affiliated dialect of word.
By taking terminal device provides the customized function of waking up word as an example, terminal device can provide a user a kind of customized
Wake up the entrance of word.The entrance can be implemented as a physical button, be based on this, and user can click physical button triggering and wake up
Word self-defining operation.Alternatively, the entrance can be the customized subitem of wake-up word in the setting options of terminal device, it is based on this,
User can enter the setting options of terminal device, then clicked for the customized subitem of wake-up word, hovered or long-pressing
Deng operation, word self-defining operation is waken up to trigger.No matter user is triggered by which kind of mode wakes up word self-defining operation, to end
It for end equipment, may be in response to wake up word self-defining operation, receive the customized voice signal of user's input, and will receive
Customized voice signal saves as voice and wakes up word.Optionally, terminal device can show an audio typing page to user, with
Record the customized voice signal that user issues.For example, user, after triggering wakes up word self-defining operation, terminal device is to user
Show the audio typing page, at this point, user can be with input speech signal " hello ", then terminal device receives voice signal " you
It can set voice signal " hello " to voice after well " and wake up word.Optionally, terminal device can safeguard a wake-up dictionary, will
The customized voice of user wakes up word and saves into wake-up dictionary.
Optionally, voice wakes up the difficulty that word is unsuitable too long, when with dialect belonging to reduction identification, but also unsuitable too short.Language
Sound wake-up word is too short, and identification is not high, be easy to cause false wake-up.For example, voice wake up word can between 3 to 5 characters, but
It is without being limited thereto.Here 1 character refers to 1 Chinese character, is also possible to 1 English alphabet.
Optionally, it in customized wake-up word, can choose easily distinguishable word, and more common word should not be selected,
To reduce application by the probability of false wake-up.
In other embodiments of the application, voice wakes up word and is mainly used for waking up or activating the speech recognition function of application
Can, can not qualifier sound wake up dialect belonging to word, i.e. user can call out using any dialect or mandarin to issue voice
Awake word.User can issue a voice signal with dialect indicative significance, such as the language after issuing voice and waking up word again
Sound signal can be content as the voice signal of " Tianjin words ", " Henan words ", " enabling South Fujian dialect " etc..It then, can be from user
The dialect for needing to carry out speech recognition is parsed in the voice signal with dialect indicative significance issued, and then never same dialect
Corresponding with the dialect parsed ASR model is selected in corresponding ASR model, and is based on selected ASR model and is carried out pair
Subsequent voice signal to be identified carries out speech recognition.For that will have the voice of dialect indicative significance here convenient for distinguishing and describing
Signal is known as the first voice signal, and the dialect parsed from first voice signal is known as the first dialect.
Wherein, all voice signals with dialect directive significance can be used as the first voice in the embodiment of the present application
Signal.For example, the first voice signal can be the voice signal that user is issued with the first dialect, so as to be believed based on the first voice
Number acoustic feature identify the first dialect.Alternatively, the first voice signal can be the voice signal of the title comprising the first dialect,
Such as in voice signal " please enable the south of Fujian Province words model ", " the south of Fujian Province words " note is the title of the first dialect.Based on this, Ke Yicong
The corresponding phoneme segment of title of the first dialect is extracted in first voice signal, and then identifies the first dialect.
Above-mentioned combination voice wakes up word and the audio recognition method of the first voice signal can be by terminal device and server phase
Mutually cooperation is implemented, and can also independently be implemented by terminal device or server.It will be said respectively for different embodiments below
It is bright:
Mode A: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device and service
Device, which cooperates, to be implemented.In mode A, terminal device supports voice arousal function, can when user wants to carry out speech recognition
To wake up word to terminal device input voice, to wake up speech identifying function.Terminal device receives voice and wakes up word, to wake up language
Sound identification function.Then, user has the first voice signal of dialect directive significance to terminal device input;Terminal device receives
After first voice signal of user's input, the first dialect for needing to carry out speech recognition is parsed from the first voice signal, i.e.,
Dialect belonging to subsequent voice signal to be identified, to be provided to carry out speech recognition using the corresponding ASR model of corresponding dialect
Basis.
Terminal device sends service request, the clothes to server after parsing the first dialect in the first voice signal
Business request instruction server is never the same as selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.Server receives
After the service request that terminal device is sent, in corresponding ASR model, selected according to the instruction of the service request never with dialect
The corresponding ASR model of first dialect, to carry out language to subsequent voice signal to be identified based on the corresponding ASR model of the first dialect
Sound identification.
Terminal device continues to send voice signal to be identified to server after sending service request to server, should be to
Recognition of speech signals belongs to the first dialect.The voice signal to be identified that server receiving terminal equipment is sent, and according to selection
The corresponding ASR model of first dialect treats recognition of speech signals and carries out speech recognition.For treating recognition of speech signals, using with
Matched ASR model carry out speech recognition, be conducive to improve speech recognition accuracy.
Optionally, voice signal to be identified can be user after inputting the first voice signal, continue defeated to terminal device
The voice signal entered is based on this, and it is defeated can also to receive user before sending voice signal to be identified to server for terminal device
The voice signal to be identified entered.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in terminal device local
Voice signal.
In some exemplary embodiments, voice wakes up word and is mainly used for waking up the speech identifying function of terminal device;And
Subsequent the first dialect for needing to carry out speech recognition can be provided by the first voice signal.Based on this, can not have to limit user's hair
Voice wakes up language form used in word out.It issues voice for example, mandarin can be used in user and wakes up word, or can also be with
Voice is issued using the first dialect and wakes up word, or other dialects different from the first dialect can also be used to issue voice and waken up
Word.
But for same user, in using terminal device procedures can with and there is a possibility that with same language side
Formula issues voice signal to terminal device.That is, user may use identical dialect to call out to terminal device input voice
Awake word and the first voice signal.For these application scenarios, terminal device receive user input the first voice signal it
Afterwards, the first dialect can be preferentially parsed from the first voice signal;Fail to parse the first dialect from the first voice signal, then may be used
To identify that voice wakes up dialect belonging to word as the first dialect.Wherein, the specifically implementation of the identification voice wake-up affiliated dialect of word
Mode is identical as the identification voice wake-up embodiment of the affiliated dialect of word in above-described embodiment, and details are not described herein.
Mode B: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device and service
Device, which cooperates, to be implemented.In mode B, the voice that terminal device is mainly used for receiving user's input wakes up word and the first voice letter
Number and be reported to server, so that server parses the first dialect from the first voice signal, this point it is different in mode A
Terminal device.Correspondingly, server is in addition to providing ASR model towards different dialects and selecting corresponding ASR model to corresponding dialect
Under voice signal carry out except speech recognition, also have and parse the function of the first dialect from the first voice signal.
In mode B, when user wants to carry out speech recognition, voice can be inputted to terminal device and wake up word.Terminal
The voice that equipment receives user's input wakes up word, and voice wake-up word is sent to server.Server is waken up based on voice
Word wakes up the speech identifying function of itself.User can continue to send the first voice to terminal device after input voice wakes up word
Signal.The first voice signal received is sent to server by terminal device.Server is parsed from the first voice signal
First dialect, and never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, is based in order to subsequent
The corresponding ASR model of first dialect carries out speech recognition to the voice signal under the first dialect.
Terminal device continues to send voice signal to be identified to server after sending the first voice signal to server.
Server after selecting the corresponding ASR model of the first dialect, can using the corresponding ASR model of the first dialect to voice to be identified into
Row speech recognition.Optionally, voice to be identified can be user after inputting the first voice signal, continue to input to terminal device
Voice signal, be based on this, terminal device can also receive user's input before sending voice signal to be identified to server
Voice signal to be identified.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in the language of terminal device local
Sound signal.
In some exemplary embodiments, server selects the first dialect pair in ASR model corresponding never with dialect
Before the ASR model answered, further includes: if failing to parse the first dialect from the first voice signal, identification voice wakes up word institute
The dialect of category is as the first dialect.
In some exemplary embodiments, server needs to carry out speech recognition parsing from the first voice signal
When the first dialect, comprising: the first voice signal is converted to the first aligned phoneme sequence based on acoustic model;By what is stored in memory
The corresponding phoneme segment of different dialect titles is matched in the first aligned phoneme sequence respectively;It is matched when in the first aligned phoneme sequence
When middle pitch plain piece section, using the corresponding dialect of phoneme segment in matching as the first dialect.
Mode C: above-mentioned combination voice wakes up the audio recognition method of word and the first voice signal by terminal device or service
Device is individually implemented.In mode C, when user wants to carry out speech recognition, voice can be inputted to terminal device or server
Wake up word.The voice wake-up word that terminal device or server are inputted according to user, wakes up speech identifying function.User is in input language
After sound wakes up word, the first voice signal that there is dialect directive significance to terminal device or server input can be continued.Terminal
Equipment or server parse the first dialect from the first voice signal, and never with dialect in corresponding ASR model, selection the
The corresponding ASR model of one dialect.
Terminal device or server can utilize the corresponding ASR of the first dialect after selecting the corresponding ASR model of the first dialect
Model carries out speech recognition to voice to be identified.Optionally, voice to be identified can be user after inputting the first voice signal,
Continue the voice signal inputted to terminal device or server, be based on this, terminal device or server are utilizing the first dialect pair
Before the ASR model answered treats recognition of speech signals progress speech recognition, the voice to be identified letter of user's input can also be received
Number.Alternatively, voice signal to be identified is also possible to be prerecorded and stored in the voice signal of terminal device or server local,
Based on this, terminal device or server directly can obtain voice signal to be identified from local.
In some exemplary embodiments, terminal device or server select in ASR model corresponding never with dialect
Before the corresponding ASR model of first dialect, further includes: if failing to parse the first dialect from the first voice signal, identify language
Sound wakes up dialect belonging to word as the first dialect.
In some exemplary embodiments, terminal device or server need to carry out parsing from the first voice signal
When the first dialect of speech recognition, comprising: the first voice signal is converted to the first aligned phoneme sequence based on acoustic model;It will storage
The corresponding phoneme segment of different dialect titles stored in device is matched in the first aligned phoneme sequence respectively;When in the first phoneme
When matching middle pitch plain piece section in sequence, using the corresponding dialect of phoneme segment in matching as the first dialect.
Optionally, it in aforesaid way A, mode B and mode C, is parsed from the first voice signal and needs to carry out language
First dialect of sound identification, comprising: the first voice signal is converted to by the first aligned phoneme sequence based on acoustic model;By different dialects
The corresponding phoneme segment of title is matched in the first aligned phoneme sequence respectively;When the matching middle pitch plain piece in the first aligned phoneme sequence
Duan Shi, using the corresponding dialect of phoneme segment in matching as the first dialect.
Wherein, it before the first voice signal is converted to the first aligned phoneme sequence based on acoustic model, needs to the first language
Sound signal carries out pretreatment and feature extraction.Wherein preprocessing process includes preemphasis, adding window framing and end-point detection.Feature mentions
Take the extraction that the acoustic features such as temporal signatures or frequency domain character are carried out to pretreated first voice signal.
The acoustic feature of first voice signal can be converted to aligned phoneme sequence by acoustic model.Phoneme be constitute pronunciation of words or
The fundamental of person's Chinese character pronunciation.Wherein, the phoneme for constituting pronunciation of words can be 39 sounds of Carnegie Mellon University's invention
Element;The phoneme for constituting Chinese character pronunciation can be whole initial consonants and simple or compound vowel of a Chinese syllable.Acoustic model is including but not limited to neural network based
Deep learning model, hidden Markov model etc..Wherein, the mode that acoustic feature is converted to aligned phoneme sequence is belonged into existing skill
Art, details are not described herein again.
Terminal device or server are after being converted to the first aligned phoneme sequence for the first voice signal, by different dialect titles pair
The phoneme segment answered is matched in the first aligned phoneme sequence respectively.Wherein it is possible to which the phoneme of different dialect titles is stored in advance
Segment, such as phoneme segment, phoneme segment, the dialect title of dialect title " the south of Fujian Province language " of dialect title " Henan words "
" British English " etc..If dialect title is word, phoneme segment is 39 invented from Carnegie Mellon University
The segment that several phonemes obtained in phoneme are constituted.If dialect title is Chinese character, phoneme segment is the initial consonant of dialect title
The segment constituted with simple or compound vowel of a Chinese syllable.Compare the first aligned phoneme sequence phoneme segment corresponding from pre-stored different dialect titles, to sentence
Whether include the same or similar phoneme segment of phoneme segment with some dialect title in disconnected first aligned phoneme sequence.Optionally,
Can calculate in the first aligned phoneme sequence each phoneme segment respectively from the similarity of the phoneme segment of different dialect titles;Never Tongfang
In the phoneme segment for saying title, the similarity of selection and some phoneme segment in the first aligned phoneme sequence meets default similarity requirement
Phoneme segment as matching in audio fragment.Then, using the corresponding dialect of phoneme segment in matching as the first dialect.
It is worth noting that having shown in some steps or content and Fig. 1-Fig. 7 in aforesaid way A, mode B and mode C
Some steps or content in embodiment are same or similar, these the same or similar contents can be found in be implemented shown in Fig. 1-Fig. 7
Description in example, details are not described herein.
In addition, containing in some processes of the description in above-described embodiment and attached drawing according to particular order appearance
Multiple operations, but it should be clearly understood that these operations can not execute or parallel according to its sequence what appears in this article
It executes, serial number of operation such as 201,202 etc. is only used for distinguishing each different operation, and serial number itself does not represent any
Execute sequence.In addition, these processes may include more or fewer operations, and these operations can execute in order or
It is parallel to execute.It should be noted that the description such as herein " first ", " second ", be for distinguish different message, equipment,
Module etc. does not represent sequencing, does not also limit " first " and " second " and is different type.
Fig. 8 is a kind of modular structure schematic diagram for speech recognition equipment that the application another exemplary embodiment provides.Such as
Shown in Fig. 8, speech recognition equipment 800 includes receiving module 801, identification module 802, the first sending module 803 and the second transmission
Module 804.
Receiving module 801 wakes up word for receiving voice.
Identification module 802, the received voice of receiving module 801 wakes up the first dialect belonging to word for identification.
First sending module 803, it is corresponding never with dialect with request server for sending service request to server
The corresponding ASR model of the first dialect is selected in ASR model.
Second sending module 804, for sending voice signal to be identified to server, for the first dialect of server by utilizing
Corresponding ASR model treats recognition of speech signals and carries out speech recognition.
In an optional embodiment, identification module 802 is specific to use when identifying that voice wakes up the first dialect belonging to word
In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and
The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect;Or by language
The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice
Matching degree meet the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and wakes up word,
Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text
The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.
In an optional embodiment, receiving module 801 is specifically used for: when receiving voice wake-up word in response to activation
Or the instruction of opening terminal apparatus, voice input interface is shown to user;The language of user's input is obtained based on voice input interface
Sound wakes up word.
In an optional embodiment, the second sending module 804 is before sending voice signal to be identified to server, also
For: output voice inputs prompt information, to prompt user to carry out voice input;Receive the voice to be identified letter of user's input
Number.
In an optional embodiment, the second sending module 804 is also used to before output voice input prompt information:
The notification message that server returns is received, notification message is used to indicate the corresponding ASR model of selected first dialect.
In an optional embodiment, receiving module 801 is also used to before receiving voice and waking up word: in response to waking up
Word self-defining operation receives the customized voice signal of user's input;Customized voice signal is saved as into voice and wakes up word.With
On describe the built-in function and structure of speech recognition equipment 800, as shown in figure 9, in practice, which can
It is embodied as a kind of terminal device, comprising: memory 901, processor 902 and communication component 903.
Memory 901 for storing computer program, and can be stored as storing various other data to support in terminal
Operation in equipment.The example of these data includes the finger of any application or method for operating on the terminal device
It enables, contact data, telephone book data, message, picture, video etc..
Memory 901 can realize by any kind of volatibility or non-volatile memory device or their combination,
Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk
Or CD.
Processor 902 is coupled with memory 901, for executing the computer program in memory 901, to be used for: passing through
Communication component 903 receives voice and wakes up word;Identify that voice wakes up the first dialect belonging to word;By communication component 903 to service
Device sends service request, with request server never with selecting the corresponding ASR mould of the first dialect in the corresponding ASR model of dialect
Type;Voice signal to be identified is sent to server by communication component 903, for the corresponding ASR of the first dialect of server by utilizing
Model treats recognition of speech signals and carries out speech recognition.
Communication component 903 wakes up word for receiving the voice, and Xiang Suoshu server sends the service request and institute
State voice signal to be identified.
In an optional embodiment, processor 902 is specific to use when identifying that voice wakes up the first dialect belonging to word
In:
Voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, is obtained
The benchmark for taking the matching degree for waking up word with voice to meet the first sets requirement wakes up the corresponding dialect of word as the first dialect;Or
The acoustic feature that voice wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustics for waking up word with voice
The matching degree of feature meets the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and is waken up
Word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and wakes up word with text
The benchmark text that matching degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.
In an optional embodiment, as shown in figure 9, the terminal device further include: display screen 904.Based on this, processor
902 receive voice wake up word when, be specifically used for: the instruction in response to being activated or switched on terminal device, by display screen 904 to
User shows voice input interface;And word is waken up based on the voice that voice input interface obtains user's input.
In an optional embodiment, the terminal device further include: audio component 906.Based on this, processor 902 to
It before server sends voice signal to be identified, is also used to: voice being exported by audio component 906 and inputs prompt information, to mention
Show that user carries out voice input;The voice signal to be identified of user's input is received by audio component 906.Correspondingly, audio group
Part 906 is also used to export voice input prompt information, and receives the voice signal to be identified of user's input.
In an optional embodiment, processor 902 is also used to: before output voice input prompt information by logical
Believe that component 903 receives the notification message that server returns, notification message is used to indicate the corresponding ASR mould of selected first dialect
Type.
In an optional embodiment, processor 902 is also used to before receiving voice and waking up word: in response to waking up word
Self-defining operation receives the customized voice signal of user's input by communication component 903;Customized voice signal is saved as
Voice wakes up word.
Further, as shown in figure 9, the terminal device further include: other components such as power supply module 905.
Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter
Calculation machine program, which is performed, can be realized each step that can be executed by terminal device in above method embodiment.
Figure 10 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides.
As shown in Figure 10, speech recognition equipment 1000 includes the first receiving module 1001, selecting module 1002, the second receiving module 1003
With identification module 1004.
First receiving module 1001, for the service request that receiving terminal apparatus is sent, service request instruction selection first
The corresponding ASR model of dialect.
Selecting module 1002, in corresponding ASR model, selecting the corresponding ASR model of the first dialect never with dialect,
First dialect is that voice wakes up dialect belonging to word.
Second receiving module 1003, the voice signal to be identified sent for receiving terminal apparatus.
Identification module 1004, for using the corresponding ASR model of the first dialect it is received to the second receiving module 1003 to
Recognition of speech signals carries out speech recognition.
In an optional embodiment, speech recognition equipment 1000 further includes building module, for never with dialect pair
In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected;To the language of different dialects
Material carries out feature extraction, to obtain the acoustic feature of different dialects;According to the acoustic feature of different dialects, different dialects pair are constructed
The ASR model answered.
The foregoing describe the built-in function of speech recognition equipment 1000 and structures, and as shown in figure 11, in practice, which knows
Other device 1000 can realize a kind of server, comprising: memory 1101, processor 1102 and communication component 1103.
Memory 1101 for storing computer program, and can be stored as storing various other data to support taking
The operation being engaged on device.The example of these data includes the instruction of any application or method for operating on the server,
Contact data, telephone book data, message, picture, video etc..
Memory 1101 can realize by any kind of volatibility or non-volatile memory device or their combination,
Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk
Or CD.
Processor 1102 is coupled with memory 1101, for executing the computer program in memory 1101, to be used for:
The service request sent by 1103 receiving terminal apparatus of communication component, the corresponding ASR of service request instruction the first dialect of selection
Model;Never with dialect in corresponding ASR model, the corresponding ASR model of the first dialect is selected, the first dialect is that voice wakes up word
Affiliated dialect;The voice signal to be identified sent by 1103 receiving terminal apparatus of communication component, and utilize the first dialect pair
The ASR model answered treats recognition of speech signals and carries out speech recognition.
Communication component 1103, for receiving the service request and the voice signal to be identified.
In an optional embodiment, processor 1102 selects the first dialect in ASR model corresponding never with dialect
It before corresponding ASR model, is also used to: collecting the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, with
To the acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.
Further, as shown in figure 11, server further include: audio component 1106.Based on this, processor 1102 is also used
In: the voice signal to be identified sent by 1106 receiving terminal apparatus of audio component.
Optionally, as shown in figure 11, which further includes other components such as display screen 1104, power supply module 1105.
Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter
Calculation machine program, which is performed, can be realized each step that can be executed by server in above method embodiment.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice wakes up in advance
Dialect belonging to word, and then never with the ASR corresponding with dialect belonging to voice wake-up word of selection in the corresponding ASR model of dialect
Model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes more dialect phonetic identifications
Automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, implements without user
It is more convenient, quick, be conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency identified to more dialect phonetics.
Figure 12 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides.
As shown in figure 12, speech recognition equipment 1200 includes receiving module 1201, the first sending module 1202 and the second sending module
1203。
Receiving module 1201 wakes up word for receiving voice.
First sending module 1202, for waking up word to the received voice of server sending/receiving module 1201, for clothes
It is never corresponding with selecting voice to wake up the first dialect belonging to word in the corresponding ASR model of dialect that business device is based on voice wake-up word
ASR model.
Second sending module 1203, for sending voice signal to be identified to server, for server by utilizing first party
Say that corresponding ASR model treats recognition of speech signals and carries out speech recognition.
In an optional embodiment, receiving module 1201 is specifically used for: when receiving voice wake-up word in response to activation
Or the instruction of opening terminal apparatus, voice input interface is shown to user;The language of user's input is obtained based on voice input interface
Sound wakes up word.
In an optional embodiment, the second sending module 1203 to server send voice signal to be identified it
Before, it is also used to: output voice input prompt information, to prompt user to carry out voice input;Receive the language to be identified of user's input
Sound signal.
In an optional embodiment, the second sending module 1203 is also used to before output voice input prompt information:
The notification message that server returns is received, notification message is used to indicate the corresponding ASR model of selected first dialect.
In an optional embodiment, receiving module 1201 is also used to before receiving voice and waking up word: in response to waking up
Word self-defining operation receives the customized voice signal of user's input.First sending module 1202 is also used to customized voice
Signal is uploaded to server.
The foregoing describe the built-in function of speech recognition equipment 1200 and structures, and as shown in figure 13, in practice, which knows
Other device 1200 can be realized as a kind of terminal device, comprising: memory 1301, processor 1302 and communication component 1303.
Memory 1301 for storing computer program, and can be stored as storing various other data to support at end
Operation in end equipment.The example of these data includes the finger of any application or method for operating on the terminal device
It enables, contact data, telephone book data, message, picture, video etc..
Memory 1301 can realize by any kind of volatibility or non-volatile memory device or their combination,
Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk
Or CD.
Processor 1302 is coupled with memory 1301, for executing the computer program in memory 1301, to be used for:
Voice, which is received, by communication component 1303 wakes up word;Voice is sent to server by communication component 1303 and wakes up word, for clothes
It is never corresponding with selecting voice to wake up the first dialect belonging to word in the corresponding ASR model of dialect that business device is based on voice wake-up word
ASR model;Voice signal to be identified is sent to server by communication component 1303, so that the first dialect of server by utilizing is corresponding
ASR model treat recognition of speech signals carry out speech recognition.
Communication component 1303 wakes up word for receiving the voice, Xiang Suoshu server send the voice wake up word and
The voice signal to be identified
In an optional embodiment, as shown in figure 13, which further includes display screen 1304.Based on this, handle
For device 1302 when receiving voice wake-up word, be specifically used for: the instruction in response to being activated or switched on terminal device passes through display screen
1304 show voice input interface to user;And word is waken up based on the voice that voice input interface obtains user's input.
In an optional embodiment, as shown in figure 13, which further includes audio component 1306.Based on this, locate
Reason device 1302 is used for: being received voice by audio component 1306 and is waken up word.Correspondingly, processor 1302 to server send to
It before recognition of speech signals, is also used to: voice being exported by audio component 1306 and inputs prompt information, to prompt user to carry out language
Sound input;And receive the voice signal to be identified of user's input.
In an optional embodiment, processor 1302 is also used to before output voice input prompt information: receiving clothes
The notification message that business device returns, notification message are used to indicate the corresponding ASR model of selected first dialect.
In an optional embodiment, processor 1302 is also used to before receiving voice and waking up word: in response to waking up word
Self-defining operation receives the customized voice signal of user's input by communication component 1303, and will be on customized voice signal
Reach server.
Further, as shown in figure 13, terminal device further include: other components such as power supply module 1305.
Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter
Calculation machine program, which is performed, can be realized each step that can be executed by terminal device in above method embodiment.
Figure 14 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides.
As shown in figure 14, speech recognition equipment 1400 includes the first receiving module 1401, the first identification module 1402, selecting module
1403, the second receiving module 1404, the second identification module 1405.
First receiving module 1401, the voice sent for receiving terminal apparatus wake up word.
First identification module 1402, voice wakes up the first dialect belonging to word for identification.
Selecting module 1403, in corresponding ASR model, selecting the corresponding ASR model of the first dialect never with dialect.
Second receiving module 1404, the voice signal to be identified sent for receiving terminal apparatus.
Second identification module 1405, for being received using the corresponding ASR model of the first dialect to the second receiving module 1404
Voice signal to be identified carry out speech recognition.
In an optional embodiment, the first identification module 1402 identify voice wake up word belonging to the first dialect when,
It is specifically used for: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively,
It obtains and meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect with the matching degree that voice wakes up word;Or
Person matches the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, obtains the sound that word is waken up with voice
The matching degree for learning feature meets the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and is called out
Awake word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and text wake-up word
Matching degree meet the benchmark text of third sets requirement and wake up the corresponding dialect of word as the first dialect.
In an optional embodiment, speech recognition equipment 1400 further includes building module, for never with dialect pair
In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected;To the language of different dialects
Material carries out feature extraction, to obtain the acoustic feature of different dialects;According to the acoustic feature of different dialects, different dialects pair are constructed
The ASR model answered.
The foregoing describe the built-in function of speech recognition equipment 1400 and structures, and as shown in figure 15, in practice, which knows
Other device 1400 can be realized as a kind of server, comprising: memory 1501, processor 1502 and communication component 1503.
Memory 1501 for storing computer program, and can be stored as storing various other data to support taking
The operation being engaged on device.The example of these data includes the instruction of any application or method for operating on the server,
Contact data, telephone book data, message, picture, video etc..
Memory 1501 can realize by any kind of volatibility or non-volatile memory device or their combination,
Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk
Or CD.
Processor 1502 is coupled with memory 1501, for executing the computer program in memory 1501, to be used for:
Word is waken up by the voice that 1503 receiving terminal apparatus of communication component is sent;Identify that voice wakes up the first dialect belonging to word;From
In the corresponding ASR model of different dialects, the corresponding ASR model of the first dialect is selected;Terminal is received by communication component 1503 to set
The voice signal to be identified that preparation is sent, and treat recognition of speech signals using the corresponding ASR model of the first dialect and carry out voice knowledge
Not.
Communication component 1503 wakes up word and voice signal to be identified for receiving voice.
In an optional embodiment, processor 1502 is specific to use when identifying that voice wakes up the first dialect belonging to word
In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and
The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect;Or by language
The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice
Matching degree meet the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and wakes up word,
Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text
The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.
In an optional embodiment, processor 1502 selects the first dialect in ASR model corresponding never with dialect
Before corresponding ASR model, it is also used to collect the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, with
To the acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.
Further, as shown in figure 15, server further include: audio component 1506.Based on this, processor 1502 is used for:
Word is waken up by the voice that 1506 receiving terminal apparatus of audio component is sent, and the terminal is received by audio component 1506 and is set
The voice signal to be identified that preparation is sent.
Further, as shown in figure 15, server further include: other components such as display screen 1504, power supply module 1505.
Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter
Calculation machine program, which is performed, can be realized each step that can be executed by server in above method embodiment.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency of more dialect phonetics identification.
Figure 16 is the modular structure schematic diagram for another speech recognition equipment that the application another exemplary embodiment provides.
As shown in figure 16, speech recognition equipment 1600 includes receiving module 1601, the first identification module 1602, selecting module 1603, the
Two identification modules 1604.
Receiving module 1601 wakes up word for receiving voice.
First identification module 1602, voice wakes up the first dialect belonging to word for identification.
Selecting module 1603, for never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect.
Second identification module 1604 carries out language for treating recognition of speech signals using the corresponding ASR model of the first dialect
Sound identification.
In an optional embodiment, the first identification module 1602 identify voice wake up word belonging to the first dialect when,
It is specifically used for: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively,
It obtains and meets the corresponding dialect of benchmark wake-up word of the first sets requirement as the first dialect with the matching degree that voice wakes up word;Or
Person matches the acoustic feature that voice wakes up word from the acoustic feature of different dialects respectively, obtains the sound that word is waken up with voice
The matching degree for learning feature meets the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and is called out
Awake word, text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains and text wake-up word
Matching degree meet the benchmark text of third sets requirement and wake up the corresponding dialect of word as the first dialect.
In an optional embodiment, receiving module 1601 is when the voice that receiving terminal apparatus is sent wakes up word, specifically
For: the instruction in response to being activated or switched on terminal device shows voice input interface to user;It is obtained based on voice input interface
The voice for taking family input wakes up word.
In an optional embodiment, the second identification module 1604 is treating knowledge using the corresponding ASR model of the first dialect
Before other voice signal carries out speech recognition, it is also used to: output voice input prompt information, it is defeated to prompt user to carry out voice
Enter;Receive the voice signal to be identified of user's input.
In an optional embodiment, speech recognition equipment 1600 further includes building module, for never with dialect pair
In the ASR model answered, before selecting the corresponding ASR model of the first dialect, the corpus of different dialects is collected;To the language of different dialects
Material carries out feature extraction, to obtain the acoustic feature of different dialects;According to the acoustic feature of different dialects, different dialects pair are constructed
The ASR model answered.
In an optional embodiment, receiving module 1601 is also used to before receiving voice and waking up word: in response to waking up
Word self-defining operation receives the customized voice signal of user's input;Customized voice signal is saved as into voice and wakes up word.
The foregoing describe the built-in function of speech recognition equipment 1600 and structures, and as shown in figure 17, in practice, which knows
Other device 1600 can be realized as a kind of electronic equipment, comprising: memory 1701, processor 1702 and communication component 1703.It should
Electronic equipment can be terminal device, be also possible to server.
Memory 1701 for storing computer program, and can be stored as storing various other data to support in electricity
Operation in sub- equipment.The example of these data includes the finger of any application or method for operating on an electronic device
It enables, contact data, telephone book data, message, picture, video etc..
Memory 1701 can realize by any kind of volatibility or non-volatile memory device or their combination,
Such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk
Or CD.
Processor 1702 is coupled with memory 1701, for executing the computer program in memory 1701, to be used for:
Voice, which is received, by communication component 1703 wakes up word;Identify that voice wakes up the first dialect belonging to word;It is never corresponding with dialect
The corresponding ASR model of the first dialect is selected in ASR model;Recognition of speech signals is treated using the corresponding ASR model of the first dialect
Carry out speech recognition.
Communication component 1703 wakes up word for receiving voice.
In an optional embodiment, processor 1702 is specific to use when identifying that voice wakes up the first dialect belonging to word
In: voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, obtain and
The benchmark that the matching degree that voice wakes up word meets the first sets requirement wakes up the corresponding dialect of word as the first dialect;Or by language
The acoustic feature that sound wakes up word is matched from the acoustic feature of different dialects respectively, obtains the acoustic feature that word is waken up with voice
Matching degree meet the dialect of the second sets requirement as the first dialect;Or voice wake-up word is converted into text and wakes up word,
Text is waken up word, and benchmark text wake-up word corresponding from different dialects matches respectively, obtains the matching that word is waken up with text
The benchmark text that degree meets third sets requirement wakes up the corresponding dialect of word as the first dialect.
In an optional embodiment, as shown in figure 17, the electronic equipment further include: display screen 1704.Based on this, handle
Device 1702 is specifically used for: the finger in response to being activated or switched on terminal device when the voice that receiving terminal apparatus is sent wakes up word
It enables, voice input interface is shown to user by display screen 1704;And the voice of user's input is obtained based on voice input interface
Wake up word.
In an optional embodiment, as shown in figure 17, the electronic equipment further include: audio component 1706.Based on this, locate
Reason device 1702 is also used to before being treated recognition of speech signals using the corresponding ASR model of the first dialect and carrying out speech recognition: logical
It crosses audio component 1706 and exports voice input prompt information, to prompt user to carry out voice input;And receive user input to
Recognition of speech signals.Correspondingly, processor 1702 is also used to: being received voice by audio component 1706 and is waken up word.
In an optional embodiment, processor 1702 selects the first dialect in ASR model corresponding never with dialect
Before corresponding ASR model, it is also used to collect the corpus of different dialects;Feature extraction is carried out to the corpus of different dialects, with
To the acoustic feature of different dialects;According to the acoustic feature of different dialects, the corresponding ASR model of different dialects is constructed.
In an optional embodiment, processor 1702 is also used to before receiving voice and waking up word: in response to waking up word
Self-defining operation receives the customized voice signal of user's input by communication component 1703;Customized voice signal is saved
Word is waken up for voice.Further, as shown in figure 17, electronic equipment further include: other components such as power supply module 1705.
Correspondingly, the embodiment of the present application also provides a kind of computer readable storage medium for being stored with computer program, meter
Calculation machine program, which is performed, can be realized each step that can be executed by electronic equipment in above method embodiment.
In the present embodiment, ASR model is constructed for different dialects, in speech recognition process, identifies that voice is called out in advance
Dialect belonging to awake word, so it is never corresponding with dialect belonging to voice wake-up word with selection in the corresponding ASR model of dialect
ASR model carries out speech recognition to subsequent voice signal to be identified using selected ASR model, realizes that more dialect phonetics are known
Other automation, and the ASR model that word automatically selects corresponding dialect is waken up based on voice, it is manually operated, realizes without user
Get up more convenient, quick, is conducive to the efficiency for improving more dialect phonetic identifications.
Further, based on voice wake up word it is more brief, identification voice wake up word belonging to dialect process time-consuming compared with
It is short, enable speech recognition system quickly to identify that voice wakes up the first dialect belonging to word, and select corresponding with the first dialect
ASR model, further increase the efficiency identified to more dialect phonetics.
The embodiment of the present application also provides a kind of terminal device, comprising: memory, processor and communication component.
Memory for storing computer program, and can be stored as storing various other data to support to set in terminal
Standby upper operation.The example of these data includes the instruction of any application or method for operating on the terminal device,
Contact data, telephone book data, message, picture, video etc..
Memory can be realized by any kind of volatibility or non-volatile memory device or their combination, such as quiet
State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), the read-only storage of erasable programmable
Device (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or light
Disk.
Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical
It crosses communication component and receives voice wake-up word, to wake up speech identifying function;It is square that having for user's input is received by communication component
Say the first voice signal of indicative significance;The first dialect for needing to carry out speech recognition is parsed from the first voice signal;From
The corresponding ASR model of the first dialect is selected in the corresponding ASR model of different dialects;It is sent and is serviced to server by communication component
Request, with request server never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect;Pass through
Communication component sends voice signal to be identified to server, so that the corresponding ASR model of the first dialect of server by utilizing treats knowledge
Other voice signal carries out speech recognition.
The communication component wakes up word and first voice signal for receiving voice, and sends out to the server
Send service request and voice signal to be identified.
In an optional embodiment, processor is also used to before sending service request to server: if failing from the
The first dialect is parsed in one voice signal, identification voice wakes up dialect belonging to word as the first dialect.
In an optional embodiment, memory is also used to store the corresponding phoneme segment of different dialect titles.Correspondingly,
Processor is specifically used for when parsing the first dialect for needing to carry out speech recognition from the first voice signal: being based on acoustics
First voice signal is converted to the first aligned phoneme sequence by model;The corresponding sound of different dialect titles that will be stored in memory
Plain piece section is matched in first aligned phoneme sequence respectively;When the matching middle pitch plain piece section in first aligned phoneme sequence
When, using the corresponding dialect of phoneme segment in the matching as first dialect.
The embodiment of the present application also provides a kind of server, comprising: memory, processor and communication component.
Memory for storing computer program, and can be stored as storing various other data to support in server
On operation.The example of these data includes the instruction of any application or method for operating on the server, connection
Personal data, telephone book data, message, picture, video etc..
Memory can be realized by any kind of volatibility or non-volatile memory device or their combination, such as quiet
State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), the read-only storage of erasable programmable
Device (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or light
Disk.
Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical
The voice for crossing the transmission of communication component receiving terminal apparatus wakes up word, to wake up speech identifying function;It is received eventually by communication component
The first voice signal with dialect indicative significance that end equipment is sent;It is parsed from the first voice signal and needs to carry out voice
First dialect of identification;Never with selecting the corresponding ASR model of the first dialect in the corresponding ASR model of dialect;Pass through communication set
The voice signal to be identified that part receiving terminal apparatus is sent, and voice to be identified is believed using the corresponding ASR model of the first dialect
Number carry out speech recognition.
Communication component wakes up word, the first voice signal and voice signal to be identified for receiving voice.
In an optional embodiment, processor selects the first dialect corresponding in ASR model corresponding never with dialect
ASR model before, be also used to: if failing to parse the first dialect from the first voice signal, identification voice wake up word belonging to
Dialect as the first dialect.
In an optional embodiment, memory is also used to store the corresponding phoneme segment of different dialect titles.Correspondingly,
Processor is specifically used for when parsing the first dialect for needing to carry out speech recognition from the first voice signal: being based on acoustics
First voice signal is converted to the first aligned phoneme sequence by model;The corresponding sound of different dialect titles that will be stored in memory
Plain piece section is matched in first aligned phoneme sequence respectively;When the matching middle pitch plain piece section in first aligned phoneme sequence
When, using the corresponding dialect of phoneme segment in the matching as first dialect.
The embodiment of the present application also provides a kind of electronic equipment, which can be terminal device, is also possible to service
Device.The electronic equipment includes: memory, processor and communication component.
Memory for storing computer program, and can be stored as storing various other data to support to set in electronics
Standby upper operation.The example of these data includes the instruction of any application or method for operating on an electronic device,
Contact data, telephone book data, message, picture, video etc..
Memory can be realized by any kind of volatibility or non-volatile memory device or their combination, such as quiet
State random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), the read-only storage of erasable programmable
Device (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk or light
Disk.
Processor is coupled with memory and communication component, for executing the computer program in memory, to be used for: logical
It crosses communication component and receives voice wake-up word, to wake up speech identifying function;It is square that having for user's input is received by communication component
Say the first voice signal of indicative significance;The first dialect for needing to carry out speech recognition is parsed from the first voice signal;From
The corresponding ASR model of the first dialect is selected in the corresponding ASR model of different dialects;Utilize the corresponding ASR model pair of the first dialect
Voice signal to be identified carries out speech recognition.
Communication component wakes up word and the first voice signal for receiving voice.
In an optional embodiment, processor selects the first dialect corresponding in ASR model corresponding never with dialect
ASR model before, be also used to: if failing to parse the first dialect from the first voice signal, identification voice wake up word belonging to
Dialect as the first dialect.
In an optional embodiment, memory is also used to store the corresponding phoneme segment of different dialect titles.Correspondingly,
Processor is specifically used for when parsing the first dialect for needing to carry out speech recognition from the first voice signal: being based on acoustics
First voice signal is converted to the first aligned phoneme sequence by model;The corresponding sound of different dialect titles that will be stored in memory
Plain piece section is matched in first aligned phoneme sequence respectively;When the matching middle pitch plain piece section in first aligned phoneme sequence
When, using the corresponding dialect of phoneme segment in the matching as first dialect.
Communication component in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 is stored as convenient for equipment where communication component
The communication of wired or wireless way between other equipment.Equipment where communication component can be accessed based on the wireless of communication standard
Network, such as WiFi, 2G or 3G or their combination.In one exemplary embodiment, communication component is received via broadcast channel
Broadcast singal or broadcast related information from external broadcasting management system.In one exemplary embodiment, communication component is also
Including near-field communication (NFC) module, to promote short range communication.For example, it can be based on radio frequency identification (RFID) technology in NFC module,
Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.
Display screen in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 includes liquid crystal display (LCD) and touch panel
(TP).If display screen includes touch panel, display screen may be implemented as touch screen, to receive input letter from the user
Number.Touch panel includes one or more touch sensors to sense the gesture on touch, slide, and touch panel.Touch sensing
Device can not only sense the boundary of a touch or slide action, but also detect the duration relevant with touch or slide and
Pressure.
Power supply module in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 is the various assemblies of equipment where power supply module
Electric power is provided.Power supply module may include power-supply management system, one or more power supplys and other with to set where power supply module
It is standby to generate, manage, and distribute the associated component of electric power.
Audio component in above-mentioned Fig. 9, Figure 11, Figure 13, Figure 15 and Figure 17 can be stored as output and/or input audio letter
Number.For example, audio component includes a microphone (MIC), the equipment where audio component is in operation mode, such as calls mould
When formula, logging mode and speech recognition mode, microphone is stored as receiving external audio signal.The received audio signal can
To be further stored in memory or be sent via communication component.In some embodiments, audio component further includes one and raises
Sound device is used for output audio signal.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical storage, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want
There is also other identical elements in the process, method of element, commodity or equipment.
The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art
For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal
Replacement, improvement etc., should be included within the scope of the claims of this application.
Claims (24)
1. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:
It receives voice and wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Service request is sent to server, to request the server never with selecting described the in the corresponding ASR model of dialect
The corresponding ASR model of one dialect;
Voice signal to be identified is sent to the server, for the corresponding ASR mould of the first dialect described in the server by utilizing
Type carries out speech recognition to the voice signal to be identified.
2. the method according to claim 1, wherein the identification voice wakes up first party belonging to word
Speech, comprising:
The voice is waken up into word and wakes up the Dynamic Matching that word carries out acoustic feature from the benchmark recorded with different dialects respectively, is obtained
The benchmark for taking the matching degree for waking up word with the voice to meet the first sets requirement wakes up the corresponding dialect of word as described first
Dialect;Or
The acoustic feature that the voice wakes up word is matched from the acoustic feature of different dialects respectively, is obtained and the voice
The matching degree for waking up the acoustic feature of word meets the dialect of the second sets requirement as first dialect;Or
Voice wake-up word is converted into text and wakes up word, the text is waken up into word benchmark corresponding from different dialects respectively
Text wakes up word and is matched, and obtains the benchmark text wake-up that the matching degree for waking up word with the text meets third sets requirement
The corresponding dialect of word is as first dialect.
3. the method according to claim 1, wherein the reception voice wakes up word, comprising:
In response to being activated or switched on the instruction of the terminal device, voice input interface is shown to user;
The voice for obtaining user's input based on the voice input interface wakes up word.
4. method according to claim 1-3, which is characterized in that sending voice to be identified to the server
Before signal, the method also includes:
It exports voice and inputs prompt information, to prompt user to carry out voice input;
Receive the voice signal to be identified of user's input.
5. according to the method described in claim 4, it is characterized in that, output voice input prompt information before, the method
Further include:
Receive the notification message that the server returns, the notification message is used to indicate that selected first dialect corresponding
ASR model.
6. method according to claim 1-3, which is characterized in that before receiving voice and waking up word, the side
Method further include:
In response to waking up word self-defining operation, the customized voice signal of user's input is received;
The customized voice signal is saved as into the voice and wakes up word.
7. a kind of audio recognition method is suitable for server, which is characterized in that the described method includes:
The service request that receiving terminal apparatus is sent, the service request instruction select the corresponding ASR model of the first dialect;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected, first dialect is institute
Predicate sound wakes up dialect belonging to word;
The voice signal to be identified that the terminal device is sent is received, and using the corresponding ASR model of first dialect to institute
It states voice signal to be identified and carries out speech recognition.
8. the method according to the description of claim 7 is characterized in that in ASR model corresponding never with dialect, described in selection
Before the corresponding ASR model of first dialect, the method also includes:
Collect the corpus of different dialects;
Feature extraction is carried out to the corpus of the different dialects, to obtain the acoustic feature of different dialects;
According to the acoustic feature of the different dialects, the corresponding ASR model of different dialects is constructed.
9. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:
It receives voice and wakes up word;
The voice is sent to server and wakes up word, so that server is corresponding never with dialect based on voice wake-up word
The voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in ASR model;
Voice signal to be identified is sent to the server, for the corresponding ASR mould of the first dialect described in the server by utilizing
Type carries out speech recognition to the voice signal to be identified.
10. a kind of audio recognition method is suitable for server, which is characterized in that the described method includes:
The voice that receiving terminal apparatus is sent wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected;
The voice signal to be identified that the terminal device is sent is received, and using the corresponding ASR model of first dialect to institute
It states voice signal to be identified and carries out speech recognition.
11. a kind of audio recognition method characterized by comprising
It receives voice and wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect;
Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition.
12. a kind of audio recognition method is suitable for terminal device, which is characterized in that the described method includes:
It receives voice and wakes up word, to wake up speech identifying function;
Receive the first voice signal with dialect indicative significance of user's input;
The first dialect for needing to carry out speech recognition is parsed from first voice signal;
Service request is sent to server, to request the server never with selecting described the in the corresponding ASR model of dialect
The corresponding ASR model of one dialect;
Voice signal to be identified is sent to the server, for the corresponding ASR mould of the first dialect described in the server by utilizing
Type carries out speech recognition to the voice signal to be identified.
13. according to the method for claim 12, which is characterized in that before sending service request to server, the side
Method further include:
If failing to parse first dialect from first voice signal, identify that the voice wakes up dialect belonging to word
As first dialect.
14. method according to claim 12 or 13, which is characterized in that described to be parsed from first voice signal
Need to carry out the first dialect of speech recognition, comprising:
First voice signal is converted into the first aligned phoneme sequence based on acoustic model;
The corresponding phoneme segment of different dialect titles is matched in first aligned phoneme sequence respectively;
When matching middle pitch plain piece section in first aligned phoneme sequence, the corresponding dialect of phoneme segment in the matching is made
For first dialect.
15. a kind of terminal device characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Service request is sent to server by the communication component, to request the server never corresponding ASR with dialect
The corresponding ASR model of first dialect is selected in model;
Voice signal to be identified is sent to the server by the communication component, for first described in the server by utilizing
The corresponding ASR model of dialect carries out speech recognition to the voice signal to be identified;
The communication component wakes up word for receiving the voice, and Xiang Suoshu server sends the service request and described
Voice signal to be identified.
16. a kind of server characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
The service request sent by the communication component receiving terminal apparatus, service request instruction first dialect pair of selection
The ASR model answered;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected, first dialect is institute
Predicate sound wakes up dialect belonging to word;
The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes first dialect pair
The ASR model answered carries out speech recognition to the voice signal to be identified;
The communication component, for receiving the service request and the voice signal to be identified.
17. a kind of terminal device characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
Send the voice to server by the communication component and wake up word, for server be based on the voice wake up word from
The voice is selected to wake up the corresponding ASR model of the first dialect belonging to word in the corresponding ASR model of different dialects;
Voice signal to be identified is sent to the server by the communication component, for first described in the server by utilizing
The corresponding ASR model of dialect carries out speech recognition to the voice signal to be identified;
The communication component wakes up word for receiving the voice, and Xiang Suoshu server sends the voice and wakes up word and described
Voice signal to be identified.
18. a kind of server characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Word is waken up by the voice that the communication component receiving terminal apparatus is sent;
Identify that the voice wakes up the first dialect belonging to word;
Never with dialect in corresponding ASR model, the corresponding ASR model of first dialect is selected;
The voice signal to be identified that the terminal device is sent is received by the communication component, and utilizes first dialect pair
The ASR model answered carries out speech recognition to the voice signal to be identified;
The communication component wakes up word and the voice signal to be identified for receiving the voice.
19. a kind of electronic equipment characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice, which is received, by the communication component wakes up word;
Identify that the voice wakes up the first dialect belonging to word;
Never with selecting the corresponding ASR model of first dialect in the corresponding ASR model of dialect;
Recognition of speech signals, which is treated, using the corresponding ASR model of first dialect carries out speech recognition;
The communication component wakes up word for receiving the voice.
20. a kind of terminal device characterized by comprising memory, processor and communication component;
The memory, for storing computer program;
The processor is coupled with the memory, for executing the computer program, to be used for:
Voice is received by the communication component and wakes up word, to wake up speech identifying function;
The first voice signal with dialect indicative significance of user's input is received by the communication component;
The first dialect for needing to carry out speech recognition is parsed from first voice signal;
Service request is sent to server by the communication component, to request the server never corresponding ASR with dialect
The corresponding ASR model of first dialect is selected in model;
Voice signal to be identified is sent to the server by the communication component, for first described in the server by utilizing
The corresponding ASR model of dialect carries out speech recognition to the voice signal to be identified
The communication component wakes up word and first voice signal for receiving the voice, and sends out to the server
Send the service request and the voice signal to be identified.
21. a kind of computer readable storage medium for being stored with computer program, which is characterized in that the computer program is counted
The step of calculation machine can be realized any one of claim 1-6 the method when executing.
22. a kind of computer readable storage medium for being stored with computer program, which is characterized in that the computer program is counted
The step of calculation machine can be realized any one of claim 7-8 the method when executing.
23. a kind of speech recognition system, which is characterized in that including server and terminal device;
The terminal device wakes up word for receiving voice, identifies that the voice wakes up the first dialect belonging to word, and to described
Server sends service request, and sends voice signal to be identified, the service request instruction selection institute to the server
State the corresponding ASR model of the first dialect;
The server, it is never corresponding with dialect according to the instruction of the service request for receiving the service request
In ASR model, the corresponding ASR model of first dialect is selected, and receive the voice signal to be identified, and described in utilization
The corresponding ASR model of first dialect carries out speech recognition to the voice signal to be identified.
24. a kind of speech recognition system, which is characterized in that including server and terminal device;
The terminal device wakes up word for receiving voice, and Xiang Suoshu server sends the voice and wakes up word, and to described
Server sends voice signal to be identified;
The server wakes up word for receiving the voice, identifies that the voice wakes up the first dialect belonging to word, from difference
In the corresponding ASR model of dialect, the corresponding ASR model of first dialect is selected, and receive the voice signal to be identified,
And speech recognition is carried out to the voice signal to be identified using first dialect corresponding ASR model.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711147698.XA CN109817220A (en) | 2017-11-17 | 2017-11-17 | Audio recognition method, apparatus and system |
TW107132609A TW201923736A (en) | 2017-11-17 | 2018-09-17 | Speech recognition method, device and system |
PCT/CN2018/114531 WO2019096056A1 (en) | 2017-11-17 | 2018-11-08 | Speech recognition method, device and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711147698.XA CN109817220A (en) | 2017-11-17 | 2017-11-17 | Audio recognition method, apparatus and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109817220A true CN109817220A (en) | 2019-05-28 |
Family
ID=66539363
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711147698.XA Pending CN109817220A (en) | 2017-11-17 | 2017-11-17 | Audio recognition method, apparatus and system |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN109817220A (en) |
TW (1) | TW201923736A (en) |
WO (1) | WO2019096056A1 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110364147A (en) * | 2019-08-29 | 2019-10-22 | 厦门市思芯微科技有限公司 | A kind of wake-up training word acquisition system and method |
CN110827799A (en) * | 2019-11-21 | 2020-02-21 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and medium for processing voice signal |
CN110853643A (en) * | 2019-11-18 | 2020-02-28 | 北京小米移动软件有限公司 | Method, device, equipment and storage medium for voice recognition in fast application |
CN111081217A (en) * | 2019-12-03 | 2020-04-28 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111091809A (en) * | 2019-10-31 | 2020-05-01 | 国家计算机网络与信息安全管理中心 | Regional accent recognition method and device based on depth feature fusion |
CN111128125A (en) * | 2019-12-30 | 2020-05-08 | 深圳市优必选科技股份有限公司 | Voice service configuration system and voice service configuration method and device thereof |
CN111724766A (en) * | 2020-06-29 | 2020-09-29 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
CN112102819A (en) * | 2019-05-29 | 2020-12-18 | 南宁富桂精密工业有限公司 | Voice recognition device and method for switching recognition languages thereof |
CN112116909A (en) * | 2019-06-20 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Voice recognition method, device and system |
CN112820296A (en) * | 2021-01-06 | 2021-05-18 | 北京声智科技有限公司 | Data transmission method and electronic equipment |
CN113506565A (en) * | 2021-07-12 | 2021-10-15 | 北京捷通华声科技股份有限公司 | Speech recognition method, speech recognition device, computer-readable storage medium and processor |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
CN104575504A (en) * | 2014-12-24 | 2015-04-29 | 上海师范大学 | Method for personalized television voice wake-up by voiceprint and voice identification |
US9275637B1 (en) * | 2012-11-06 | 2016-03-01 | Amazon Technologies, Inc. | Wake word evaluation |
CN105654943A (en) * | 2015-10-26 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Voice wakeup method, apparatus and system thereof |
CN106653031A (en) * | 2016-10-17 | 2017-05-10 | 海信集团有限公司 | Voice wake-up method and voice interaction device |
CN106997762A (en) * | 2017-03-08 | 2017-08-01 | 广东美的制冷设备有限公司 | The sound control method and device of household electrical appliance |
CN107134279A (en) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | A kind of voice awakening method, device, terminal and storage medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9431012B2 (en) * | 2012-04-30 | 2016-08-30 | 2236008 Ontario Inc. | Post processing of natural language automatic speech recognition |
CN105223851A (en) * | 2015-10-09 | 2016-01-06 | 韩山师范学院 | Based on intelligent socket system and the control method of accent recognition |
CN105957527A (en) * | 2016-05-16 | 2016-09-21 | 珠海格力电器股份有限公司 | Method and device for voice control of electric appliance and voice control air conditioner |
CN106452997A (en) * | 2016-09-30 | 2017-02-22 | 无锡小天鹅股份有限公司 | Household electrical appliance and control system thereof |
-
2017
- 2017-11-17 CN CN201711147698.XA patent/CN109817220A/en active Pending
-
2018
- 2018-09-17 TW TW107132609A patent/TW201923736A/en unknown
- 2018-11-08 WO PCT/CN2018/114531 patent/WO2019096056A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9275637B1 (en) * | 2012-11-06 | 2016-03-01 | Amazon Technologies, Inc. | Wake word evaluation |
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
CN104575504A (en) * | 2014-12-24 | 2015-04-29 | 上海师范大学 | Method for personalized television voice wake-up by voiceprint and voice identification |
CN105654943A (en) * | 2015-10-26 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Voice wakeup method, apparatus and system thereof |
CN106653031A (en) * | 2016-10-17 | 2017-05-10 | 海信集团有限公司 | Voice wake-up method and voice interaction device |
CN106997762A (en) * | 2017-03-08 | 2017-08-01 | 广东美的制冷设备有限公司 | The sound control method and device of household electrical appliance |
CN107134279A (en) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | A kind of voice awakening method, device, terminal and storage medium |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102819A (en) * | 2019-05-29 | 2020-12-18 | 南宁富桂精密工业有限公司 | Voice recognition device and method for switching recognition languages thereof |
CN112116909A (en) * | 2019-06-20 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Voice recognition method, device and system |
CN110364147A (en) * | 2019-08-29 | 2019-10-22 | 厦门市思芯微科技有限公司 | A kind of wake-up training word acquisition system and method |
CN111091809A (en) * | 2019-10-31 | 2020-05-01 | 国家计算机网络与信息安全管理中心 | Regional accent recognition method and device based on depth feature fusion |
CN110853643A (en) * | 2019-11-18 | 2020-02-28 | 北京小米移动软件有限公司 | Method, device, equipment and storage medium for voice recognition in fast application |
CN110827799A (en) * | 2019-11-21 | 2020-02-21 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and medium for processing voice signal |
CN111081217B (en) * | 2019-12-03 | 2021-06-04 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111081217A (en) * | 2019-12-03 | 2020-04-28 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111128125A (en) * | 2019-12-30 | 2020-05-08 | 深圳市优必选科技股份有限公司 | Voice service configuration system and voice service configuration method and device thereof |
CN111724766A (en) * | 2020-06-29 | 2020-09-29 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
CN111724766B (en) * | 2020-06-29 | 2024-01-05 | 合肥讯飞数码科技有限公司 | Language identification method, related equipment and readable storage medium |
CN112820296A (en) * | 2021-01-06 | 2021-05-18 | 北京声智科技有限公司 | Data transmission method and electronic equipment |
CN113506565A (en) * | 2021-07-12 | 2021-10-15 | 北京捷通华声科技股份有限公司 | Speech recognition method, speech recognition device, computer-readable storage medium and processor |
CN113506565B (en) * | 2021-07-12 | 2024-06-04 | 北京捷通华声科技股份有限公司 | Speech recognition method, device, computer readable storage medium and processor |
Also Published As
Publication number | Publication date |
---|---|
TW201923736A (en) | 2019-06-16 |
WO2019096056A1 (en) | 2019-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109817220A (en) | Audio recognition method, apparatus and system | |
CN103366740B (en) | Voice command identification method and device | |
US10977299B2 (en) | Systems and methods for consolidating recorded content | |
CN103095911B (en) | Method and system for finding mobile phone through voice awakening | |
CN102568478B (en) | Video play control method and system based on voice recognition | |
US10719115B2 (en) | Isolated word training and detection using generated phoneme concatenation models of audio inputs | |
US20170140750A1 (en) | Method and device for speech recognition | |
CN110706690A (en) | Speech recognition method and device | |
WO2017012511A1 (en) | Voice control method and device, and projector apparatus | |
CN110473546B (en) | Media file recommendation method and device | |
US20110093261A1 (en) | System and method for voice recognition | |
CN106971723A (en) | Method of speech processing and device, the device for speech processes | |
CN100521708C (en) | Voice recognition and voice tag recoding and regulating method of mobile information terminal | |
US20160372110A1 (en) | Adapting voice input processing based on voice input characteristics | |
CN105206271A (en) | Intelligent equipment voice wake-up method and system for realizing method | |
CN109994106B (en) | Voice processing method and equipment | |
CN101794576A (en) | Dirty word detection aid and using method thereof | |
CN110634466B (en) | TTS treatment technology with high infectivity | |
US10699706B1 (en) | Systems and methods for device communications | |
CN105206123B (en) | A kind of deaf and dumb patient's ac equipment | |
US20190066669A1 (en) | Graphical data selection and presentation of digital content | |
CN109272991A (en) | Method, apparatus, equipment and the computer readable storage medium of interactive voice | |
CN101825953A (en) | Chinese character input product with combined voice input and Chinese phonetic alphabet input functions | |
CN111862943B (en) | Speech recognition method and device, electronic equipment and storage medium | |
CN113611316A (en) | Man-machine interaction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190528 |
|
RJ01 | Rejection of invention patent application after publication |