WO2019096056A1 - Speech recognition method, device and system - Google Patents

Speech recognition method, device and system Download PDF

Info

Publication number
WO2019096056A1
WO2019096056A1 PCT/CN2018/114531 CN2018114531W WO2019096056A1 WO 2019096056 A1 WO2019096056 A1 WO 2019096056A1 CN 2018114531 W CN2018114531 W CN 2018114531W WO 2019096056 A1 WO2019096056 A1 WO 2019096056A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialect
voice
wake
word
server
Prior art date
Application number
PCT/CN2018/114531
Other languages
French (fr)
Chinese (zh)
Inventor
牛也
徐巍越
冯伟国
黄光远
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019096056A1 publication Critical patent/WO2019096056A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present application relates to the field of voice recognition technologies, and in particular, to a voice recognition method, apparatus, and system.
  • ASR Automatic Speech Recognition
  • aspects of the present application provide a speech recognition method, apparatus, and system for automatically performing speech recognition on a plurality of dialects, and improving the efficiency of speech recognition for a plurality of dialects.
  • the embodiment of the present application provides a voice recognition method, which is applicable to a terminal device, and the method includes:
  • the embodiment of the present application further provides a voice recognition method, which is applicable to a server, and the method includes:
  • the embodiment of the present application further provides a voice recognition method, which is applicable to a terminal device, and the method includes:
  • the embodiment of the present application further provides a voice recognition method, which is applicable to a server, and the method includes:
  • the embodiment of the present application further provides a voice recognition method, including:
  • the ASR model corresponding to the first dialect is used for speech recognition of the speech signal to be recognized.
  • the embodiment of the present application further provides a voice recognition method, which is applicable to a terminal device, and the method includes:
  • the embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the communication component is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
  • the embodiment of the present application further provides a server, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the communication component is configured to receive the service request and the to-be-identified voice signal.
  • the embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the server Sending, by the communication component, the voice wake-up words to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;
  • the communication component is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server.
  • the embodiment of the present application further provides a server, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the communication component is configured to receive the voice wake-up word and the to-be-identified voice signal.
  • the embodiment of the present application further provides an electronic device, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the communication component is configured to receive the voice wake-up word.
  • the embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
  • the memory for storing a computer program
  • the processor coupled to the memory, for executing the computer program for:
  • the communication component is configured to receive the voice wake-up word and the first voice signal, and send the service request and the to-be-identified voice signal to the server.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, which is capable of implementing the steps in the foregoing first voice recognition method embodiment when the computer program is executed by a computer.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed by a computer, the steps in the second voice recognition method embodiment can be implemented.
  • the embodiment of the present application further provides a voice recognition system, including a server and a terminal device;
  • the terminal device is configured to receive a voice wake-up word, identify a first dialect to which the voice wake-up word belongs, send a service request to the server, and send a to-be-identified voice signal to the server, where the service request indicates selection
  • the ASR model corresponding to the first dialect
  • the server is configured to receive the service request, select an ASR model corresponding to the first dialect, and receive the to-be-identified voice signal from an ASR model corresponding to different dialects according to the indication of the service request, and Performing speech recognition on the to-be-identified speech signal by using the ASR model corresponding to the first dialect. .
  • the embodiment of the present application further provides a voice recognition system, including a server and a terminal device;
  • the terminal device is configured to receive a voice wake-up word, send the voice wake-up word to the server, and send the to-be-identified voice signal to the server;
  • the server is configured to receive the voice wake-up word, identify a first dialect to which the voice wake-up word belongs, select an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects, and receive the waiting Identifying a voice signal, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  • the ASR model is constructed for different dialects, and in the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and faster to implement without manual operation by the user. It is beneficial to improve the efficiency of multi-language speech recognition.
  • FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice recognition method according to another exemplary embodiment of the present application.
  • FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application.
  • FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application.
  • FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application.
  • FIG. 7 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to another exemplary embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a terminal device according to another exemplary embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another voice recognition apparatus according to another exemplary embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a server according to still another exemplary embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of still another terminal device according to another exemplary embodiment of the present disclosure.
  • FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of another server according to another exemplary embodiment of the present disclosure.
  • FIG. 16 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present disclosure.
  • FIG. 17 is a schematic structural diagram of an electronic device according to still another exemplary embodiment of the present application.
  • the speech recognition scheme for the dialect is not mature.
  • the embodiment of the present application provides a solution, and the main idea of the solution is to construct an ASR model for different dialects in the process of speech recognition. Pre-identifying the dialect to which the speech wake-up word belongs, and then selecting the ASR model corresponding to the dialect to which the speech wake-up word belongs from the ASR models corresponding to different dialects, and using the selected ASR model to perform speech recognition on the subsequent to-be-recognized speech signal, thereby realizing more
  • the dialect speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and fast to implement without manual operation by the user, and is beneficial to improve the efficiency of multi-dial speech recognition.
  • FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application.
  • the speech recognition system 100 includes a server 101 and a terminal device 102.
  • a communication connection is made between the server 101 and the terminal device 102.
  • the terminal device 102 can communicate with the server 101 via the Internet, or can also communicate with the server 101 via a mobile network. If the terminal device 102 is in communication connection with the server 101 through the mobile network, the network standard of the mobile network may be 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE). Any of 4G+ (LTE+), WiMax, and the like.
  • the server 101 mainly provides an ASR model for different dialects, and selects a corresponding ASR model to perform speech recognition on the speech signals in the corresponding dialects.
  • the server 101 can be any device that can provide computing services, can respond to service requests, and process, such as a conventional server, a cloud server, a cloud host, a virtual center, and the like.
  • the composition of the server mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture.
  • the terminal device 102 is mainly oriented to the user, and may provide an interface or portal for voice recognition to the user.
  • the terminal device 102 can be implemented in various forms, such as a smart phone, a smart speaker, a personal computer, a wearable device, a tablet computer, and the like.
  • the terminal device 102 typically includes at least one processing unit and at least one memory. The number of processing units and memories depends on the configuration and type of terminal device 102.
  • the memory may include volatile, such as RAM, and may also include non-volatile, such as Read-Only Memory (ROM), flash memory, etc., or both.
  • An operating system (OS) one or more applications, and program data are stored in the memory.
  • the terminal device 102 also includes some basic configurations, such as a network card chip, an IO bus, an audio and video component (such as a microphone), and the like.
  • the terminal device 102 may also include some peripheral devices such as a keyboard, a mouse, a stylus, a printer, and the like. These peripheral devices are well known in the art and will not be described herein.
  • the terminal device 102 and the server 101 cooperate with each other to provide a voice recognition function to the user.
  • the terminal device 102 may be used by multiple users, and multiple users may hold different dialects.
  • the geographical division can include the following types of dialects: Mandarin dialect, Jin dialect, Xiang dialect, Yi dialect, Wu dialect, Yi dialect, Cantonese, and Hakka.
  • some dialects can also be subdivided.
  • proverbs can include Minbei dialect, Minnan dialect, Mindong dialect, Suizhong dialect, and Zhuxian dialect. The pronunciation of different dialects is quite different, and the same ASR model cannot be used for speech recognition.
  • the ASR models are separately constructed for different dialects in order to perform speech recognition on different dialects. Further, based on the cooperation between the terminal device 102 and the server 101, a voice recognition function can be provided to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.
  • the terminal device 102 supports the voice wake-up word function, that is, when the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device 102 to wake up the voice recognition function.
  • the voice wake-up word is a voice signal specifying a text content, and may be, for example, "on", “Tmall Elf", “hello”, and the like.
  • the terminal device 102 receives the voice wake-up word input by the user, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs), and adopts the ASR model corresponding to the corresponding dialect.
  • the dialect to which the speech wake-up word belongs is recorded as the first dialect.
  • the first dialect to which the voice wake-up word belongs may be any dialect in any language.
  • the terminal device 102 may send a service request to the server 101, the service request instructing the server 101 to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the server 101 receives the service request sent by the terminal device 102, and then selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects according to the indication of the service request, so as to perform subsequent speech recognition based on the ASR model corresponding to the first dialect.
  • the signal is speech recognized.
  • the server 101 stores in advance an ASR model corresponding to different dialects.
  • the ASR model is a model that converts speech signals into text.
  • an ASR model corresponding to a dialect, or several similar dialects may also correspond to the same ASR model, which is not limited thereto.
  • the ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
  • the terminal device 102 After transmitting the service request to the server 101, the terminal device 102 continues to send the to-be-identified voice signal to the server 101, and the to-be-identified voice signal belongs to the first dialect.
  • the server 101 receives the to-be-recognized speech signal sent by the terminal device 102, and performs speech recognition on the speech signal to be recognized according to the ASR model corresponding to the selected first dialect, not only can perform speech recognition on the first dialect, but also adopts the matching ASR model. Conducive to improve the accuracy of speech recognition.
  • the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 102 after inputting the voice wake-up word. Based on this, the terminal device 102 may further receive the user input before transmitting the to-be-identified voice signal to the server 101.
  • the speech signal to be recognized may also be a voice signal pre-recorded and stored locally in the terminal device 102, based on which the terminal device 102 may directly acquire the voice signal to be recognized from the local.
  • the server 101 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 102.
  • the server 101 may return the text content recognized by the voice to the terminal device 102; or the server 101 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 102.
  • the terminal device 102 receives the speech recognition result returned by the server 101 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result.
  • the terminal device 102 may present the text content to the user, or may perform a network search or the like based on the text content.
  • the terminal device 102 can play information such as songs and videos, or can forward information such as songs and videos to other users for information sharing. .
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
  • the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs is not limited, and any manner in which the first dialect to which the voice wake-up word belongs can be applied to the embodiments of the present application.
  • the terminal device 102 recognizes the dialect to which the speech wake-up word belongs are listed:
  • the terminal device 102 dynamically matches the voice wake-up words with the reference wake-up words recorded in different dialects, and obtains a dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement.
  • the terminal device 102 dynamically matches the voice wake-up words with the reference wake-up words recorded in different dialects, and obtains a dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement.
  • the reference wake-up words are recorded in advance in different dialects.
  • the reference wake-up words recorded in different dialects are the same as the text content of the voice wake-up words. Due to the different vocalization mechanisms of users with different dialects, the acoustic characteristics of the benchmark keywords recorded in different dialects are different.
  • the terminal device 102 pre-records the reference wake-up words in different dialects. After receiving the voice wake-up words input by the user, the voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects to obtain the The matching degree of different benchmark wake-up words.
  • the first setting requirement may be different according to different application scenarios.
  • a dialect corresponding to the reference wake-up word with the highest degree of matching with the voice wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the reference with the voice wake-up word is greater than the matching threshold.
  • the dialect corresponding to the word is used as the first dialect; or a matching degree range may be set, and the dialect corresponding to the reference wake-up word falling within the matching degree range with the matching degree of the voice wake-up word is used as the first dialect.
  • the acoustic features may be embodied as time domain features and frequency domain features of the speech signal.
  • dynamic matching of the speech wake words can be performed based on dynamic time warping (DTW) methods.
  • DTW dynamic time warping
  • the dynamic time bending method is a method of measuring the similarity between two time series.
  • the terminal device 102 generates a time series of the speech wake-up words according to the input speech wake-up words, and compares them with the time series of the reference wake-up words recorded in different dialects, respectively. At least one pair of similarities is determined between the two time series participating in the comparison.
  • the similarity between two time series is measured by the sum of the distances between similar points, that is, the distance of the normalized path.
  • the dialect corresponding to the reference wake-up word with the smallest distance from the regular path of the voice wake-up word may be used as the first dialect; or a distance threshold may be set to wake up the reference with the regular path distance of the voice wake-up word less than the distance threshold
  • the dialect corresponding to the word is used as the first dialect; a distance range may be set, and the dialect corresponding to the reference wake-up word falling within the distance range from the regular path distance of the voice wake-up word is used as the first dialect.
  • the terminal device 102 recognizes the acoustic features of the voice wake-up words, and matches the acoustic features of the voice wake-up words with the acoustic features of different dialects respectively, and obtains a dialect that matches the acoustic features of the voice wake-up words according to the second setting requirement. As the first dialect.
  • the acoustic features of different dialects are acquired in advance, and the first dialect to which the speech wake-up word belongs is determined based on the acoustic features of the awakened words of the speech, and then based on the matching between the acoustic features.
  • the speech wake words may be filtered and digitized prior to identifying the acoustic features of the speech wake words.
  • the filtering process refers to the preservation of the signal in the speech wake-up word with a frequency between 300 and 3400 Hz.
  • Digitization refers to A/D conversion and anti-aliasing processing of reserved signals.
  • the acoustic features of the speech wake words may be identified by calculating spectral feature parameters of the speech wake words, such as sliding differential cepstral parameters. Similar to mode 1, the second setting requirement may be different depending on the application scenario. For example, a dialect corresponding to the reference wake-up word with the highest degree of matching of the acoustic features of the voice wake-up word may be used as the first dialect; a matching degree threshold may also be set, and the matching degree with the acoustic feature of the voice-awaken word is greater than the matching degree.
  • the dialect corresponding to the reference wake-up word of the threshold is used as the first dialect; and a matching range is set, and the dialect corresponding to the reference wake-up word falling within the matching degree range is matched with the dialect of the acoustic feature of the speech wake-up word.
  • the sliding differential cepstrum parameter is composed of several blocks of differential cepstrum across multiple frames of speech. Considering the influence of frame difference cepstrum before and after, more timing features are incorporated. Comparing the sliding differential cepstrum parameter of the reference wake-up word with the sliding differential cepstrum parameter of the reference wake-up word recorded in different dialects, optionally corresponding to the reference wake-up word with the highest matching degree of the sliding differential cepstral parameter of the reference wake-up word
  • the dialect is used as the first dialect; a parameter difference threshold may also be set, and the dialect corresponding to the speech wake-up word whose difference between the sliding differential cepstral parameters of the reference wake-up words is less than the parameter difference threshold is used as the first dialect;
  • the difference range is a dialect corresponding to the reference wake-up word falling within the parameter difference range from the difference differential cepstral parameter of the reference wake-up word as the first dialect.
  • the voice wake-up words are converted into text wake-up words, and the text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the reference text wake-up words corresponding to the third set requirement are obtained.
  • the dialect is the first dialect.
  • the text wake-up word is a text converted by the voice wake-up word after the voice recognition
  • the reference text wake-up word corresponding to the different dialects is the text converted into the reference wake-up speech corresponding to the different dialects.
  • the same speech recognition model may be used for rough speech recognition to improve the efficiency of the entire speech recognition process.
  • the ASR model corresponding to different dialects may be used to perform voice recognition on the reference wake-up words corresponding to different dialects and convert them into corresponding reference text wake-up words. After receiving the voice wake-up words, the ASR corresponding to one dialect may be selected in turn.
  • the corresponding dialect is used as the first dialect to which the speech wake-up word belongs.
  • the dialect corresponding to the reference text wake-up word with the highest matching degree of the text wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the text wake-up word is The dialect corresponding to the reference text wake-up word whose matching degree is greater than the matching degree threshold is used as the first dialect; and a matching degree range may be set, corresponding to the reference text wake-up word falling within the matching degree range with the matching degree of the text wake-up word
  • the dialect is the first dialect.
  • first setting requirement, the second setting requirement, and the third setting requirement may be the same or different.
  • the terminal device 102 is a device with a display screen, such as a mobile phone, a computer, a wearable device, etc., and may display a voice input interface on the display screen, and obtain text information input by the user through the voice input interface. voice signal.
  • a voice input interface on the display screen
  • voice signal may be sent to the terminal device 102 by pressing an open button of the terminal device or a display screen of the touch terminal device 102.
  • the terminal device 102 can present a voice input interface to the user on the display in response to an instruction to activate or turn on itself.
  • an icon of the microphone or text information like “Wake Up Word Input” may be displayed on the voice input interface to instruct the user to input the voice wake up word.
  • the terminal device 102 can acquire a voice wake-up word input by the user based on the voice input interface.
  • the terminal device 102 may be a device with a voice playback function such as a mobile phone, a computer, a smart speaker, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as "please speak", "please order", etc., may be output to prompt The user makes a voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the voice input prompt tone.
  • a voice playback function such as a mobile phone, a computer, a smart speaker, or the like.
  • the terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101.
  • the server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
  • the terminal device 102 may be a device having a display screen such as a mobile phone, a computer, a wearable device, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as a text like "speak", a microphone icon, may be displayed in a text or icon. Etc., to prompt the user for voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 under the prompt of the voice input prompt information.
  • the voice input prompt information such as a text like "speak", a microphone icon
  • the terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101.
  • the server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
  • the terminal device 102 may have an indicator light. Based on this, after the terminal device 102 transmits the service request to the server 101, and before transmitting the voice signal to be recognized to the server 101, the indicator light can be illuminated to prompt the user to perform voice input. For the user, after inputting the voice wake-up word, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the indicator light.
  • the terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101.
  • the server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
  • the terminal device 102 can simultaneously have at least two or three of a voice playing function, an indicator light, and a display screen. Based on this, the terminal device 102 can simultaneously output the voice input prompt information in two or three of an audio manner, a text or an icon manner, and a manner of lighting the indicator light, thereby enhancing the interaction effect with the user.
  • the terminal device 102 may predetermine that the server 101 has selected the ASR model corresponding to the first dialect so as to facilitate the user. After the input voice signal to be recognized is transmitted to the server 101, the server 101 can directly recognize the voice signal to be recognized according to the selected ASR model. Based on this, after selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the server 101 returns a notification message to the terminal device 102, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  • the terminal device 102 can also receive the notification message returned by the server 101, and further, based on the notification message, learn that the server 101 has selected the ASR model corresponding to the first dialect. Further, after receiving the notification message returned by the server 101, the terminal device 102 may output a voice input prompt tone, or output a voice input prompt message, or light an indicator light to prompt the user to perform voice input.
  • the server 101 before selecting the ASR model corresponding to the first dialect, the server 101 needs to construct an ASR model corresponding to different dialects.
  • the process of the server 101 constructing the ASR model corresponding to different dialects mainly includes: collecting corpus of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect corresponding according to acoustic characteristics of different dialects ASR model.
  • the corpus of different dialects may be collected through the network, or a large number of users holding different dialects may be voice recorded to obtain corpus of different dialects.
  • the collected corpus of different dialects may be pre-processed before feature extraction of corpora of different dialects.
  • the preprocessing process includes pre-emphasis processing, windowing processing, and endpoint detection processing on the voice.
  • feature extraction can be performed on the speech.
  • Features of speech include time domain features and frequency domain features.
  • the time domain features include short-term average energy, short-term average zero-crossing rate, formant, pitch period, etc.
  • the frequency domain features include linear prediction coefficients, LPC cepstral coefficients, line spectrum pair parameters, short-term spectrum, and Mel frequency. Spectral coefficient, etc.
  • the process of extracting the acoustic features will be described by taking the Mel frequency cepstrum coefficient as an example.
  • several band-pass filters are set in the spectral range of the speech.
  • Each band-pass filter has a triangular or sinusoidal filtering characteristic, and then the eigenvectors obtained by filtering the corpus in the bandpass filter are used.
  • the energy information is included, the signal energy of several bandpass filters is calculated, and the Mel frequency cepstrum coefficient is calculated by discrete cosine transform.
  • the acoustic features of different dialects are used as input, and the text corresponding to the corpus of different dialects is output as the output, and the parameters in the initial model corresponding to different dialects are trained to obtain the ASR models corresponding to different dialects.
  • the ASR model includes, but is not limited to, a model constructed based on vector quantization, a neural network model, and the like.
  • the terminal device with the song-song function may be a smart speaker.
  • the smart speaker has a display screen, and the preset voice wake-up word of the smart speaker is “hello”.
  • the Cantonese user first touches the display screen to input an instruction to activate the smart speaker, and the smart speaker displays a voice input interface on the display screen in response to an instruction to activate the terminal device, and the voice input interface There is a "Hello” text on it.
  • the Cantonese user inputs a "hello” voice signal to the voice input interface.
  • the intelligent speaker acquires the "hello” voice signal input by the user based on the voice input interface, and recognizes that "hello” belongs to the Cantonese dialect; then, sends a service request to the server to request the server to select the Cantonese dialect from the ASR model corresponding to different dialects. Corresponding ASR model. After receiving the service request, the server selects the ASR model corresponding to the Cantonese dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Cantonese dialect has been selected. Then, the smart speaker outputs a voice input prompt message, such as "please input voice" to prompt the user to input voice.
  • the Cantonese user enters the voice signal of the song name “Five Star Red Flag” at the prompt of the voice input prompt message.
  • the intelligent speaker receives the voice signal “Five Star Red Flag” input by the Cantonese user, and sends the voice signal “Five Star Red Flag” to the server.
  • the server uses the ASR model corresponding to the Cantonese dialect to perform speech recognition on the voice signal "five-star red flag” to obtain the text information "five-star red flag", and deliver the song matching the "five-star red flag” to the smart speaker for the smart speaker to play the song. .
  • the Tibetan user can input the “hello” voice signal on the voice input interface displayed by the smart speaker.
  • the smart speaker recognizes "Hello” as a Vietnamese dialect; then, sends a service request to the server to request the server to select the ASR model corresponding to the Vietnamese dialect from the ASR models corresponding to different dialects.
  • the server selects the ASR model corresponding to the Vietnamese dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Vietnamese dialect has been selected.
  • the smart speaker outputs a voice input prompt message, such as "please input voice” to prompt the user to input voice.
  • a voice input prompt message such as "please input voice” to prompt the user to input voice.
  • the Vietnamese user enters the voice signal of the song name "My Country” at the prompt of the voice input prompt message.
  • the smart speaker receives the voice signal "My Country” input by the user and sends the voice signal "My Country” to the server.
  • the server uses the ASR model corresponding to the Vietnamese dialect to perform speech recognition on the voice signal "My Country” to obtain the text message "My Country", and deliver the song matching "My Country” to the smart speaker for intelligence.
  • the speaker plays the song.
  • the voice recognition method provided by the embodiment of the present application is used.
  • the user does not need to manually switch the ASR model, and only needs to input the voice wake-up word in the corresponding dialect.
  • the intelligent sound can automatically recognize the dialect to which the voice wake-up word belongs and then request the server to start the song name corresponding to the ASR model corresponding to the corresponding dialect, and improve the efficiency of the song while supporting the multi-dial automatic song.
  • FIG. 2 is a schematic flowchart diagram of a voice recognition method according to another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the terminal device. As shown in Figure 2, the method includes:
  • the voice signal to be identified is sent to the server, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.
  • a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying the text content, such as "on”, “Tmall Elf", “hello”, and the like.
  • the terminal device receives the voice wake-up word input by the user, identifies a dialect to which the voice wake-up word belongs, and further determines a dialect to which the subsequent voice signal to be recognized belongs (ie, a dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect.
  • Speech recognition provides the foundation. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.
  • the terminal device sends a service request to the server, and the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. Then, the terminal device transmits the to-be-identified voice signal to the server. After receiving the service request, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, and identifies the received voice signal to be recognized through the selected ASR model corresponding to the first dialect.
  • the terminal device identifies the first dialect to which the voice wake-up word belongs, and sends a service request to the server, so that the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the correspondence based on the first dialect.
  • the ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-language speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and faster to implement, and is beneficial to improve. The efficiency of multi-uttered speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. Further improve the efficiency of identifying the recognized speech.
  • one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words.
  • the dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
  • another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects respectively, and obtaining the matching degree with the acoustic features of the voice wake-up words according to the second Set the required dialect as the first dialect.
  • the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and matching the text wake-up word with the reference text wake-up word corresponding to different dialects respectively, acquiring and text
  • the dialect of the wake-up word conforms to the dialect corresponding to the reference text wake-up word of the third setting requirement as the first dialect.
  • one manner of receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
  • the method before transmitting the to-be-identified voice signal to the server, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the voice signal to be recognized input by the user.
  • the method before outputting the voice input prompt information, the method further includes: receiving a notification message returned by the server, the notification message being used to indicate that the ASR model corresponding to the first dialect has been selected.
  • FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the server. As shown in FIG. 3, the method includes:
  • the ASR model corresponding to different dialects select the ASR model corresponding to the first dialect, and the first dialect is the dialect to which the speech wake-up word belongs.
  • the terminal device after identifying the first dialect to which the voice wake-up word belongs, sends a service request to the server.
  • the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing multi-utter speech recognition.
  • the automation, and based on the voice wake-up words automatically select the ASR model of the corresponding dialect, without the user manual operation, the implementation is more convenient and fast, and is conducive to improving the efficiency of multi-dial speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
  • the server needs to construct an ASR model corresponding to different dialects before selecting the ASR model corresponding to the first dialect.
  • a process of constructing ASR models corresponding to different dialects mainly includes: collecting corpora of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect correspondence according to acoustic characteristics of different dialects ASR model.
  • the speech recognition result or the association information of the speech recognition result may be transmitted to the terminal device for the terminal device to be based on the speech recognition.
  • the result or the associated information of the speech recognition result performs subsequent processing.
  • FIG. 4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application.
  • the speech recognition system 400 includes a server 401 and a terminal device 402. A communication connection is made between the server 401 and the terminal device 402.
  • the architecture of the speech recognition system 400 provided in this embodiment is the same as that of the speech recognition system 100 shown in FIG. 1, except that the functions of the server 401 and the terminal device 402 in the speech recognition process are different.
  • the terminal device 402 and the server 401 in FIG. 4 and the communication connection manner refer to the description of the embodiment shown in FIG. 1 , and details are not described herein again.
  • the terminal device 402 and the server 401 cooperate with each other, and the speech recognition function can also be provided to the user.
  • the terminal device 402 may be used by multiple users, multiple users may hold different dialects, and thus, in the voice recognition system 400, the ASR model is separately constructed for different dialects, and further, based on the terminal.
  • the cooperation between the device 402 and the server 401 can provide a voice recognition function to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.
  • the terminal device 402 also supports the voice wake-up word function, but the terminal device 402 is mainly used to receive the voice wake-up words input by the user and report it to the server 401 for the server 401 to identify the voice wake-up word.
  • the server 401 provides an ASR model for different dialects and selects a corresponding ASR model for speech recognition of the speech signal in the corresponding dialect, and also has a dialect for identifying the speech awakening word.
  • the voice wake-up word can be input to the terminal device 402, and the voice wake-up word is a voice signal specifying the text content, such as "open", “Tmall Elf” ", "hello” and so on.
  • the terminal device 402 receives the voice wake-up word input by the user, and transmits the voice wake-up word to the server 401.
  • the server 401 After receiving the voice wake-up word sent by the terminal device 402, the server 401 identifies the dialect to which the voice wake-up word belongs. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.
  • the first dialect refers to a dialect to which the awakening word belongs, and may be, for example, a Mandarin dialect, a Jin dialect or a Xiang dialect.
  • the server 401 selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the voice signals in the first dialect based on the ASR model corresponding to the first dialect.
  • the server 401 stores in advance an ASR model corresponding to different dialects.
  • an ASR model corresponding to a dialect, or several similar dialects may also correspond to the same ASR model, which is not limited thereto.
  • the ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
  • the terminal device 402 After transmitting the voice wake-up word to the server 401, the terminal device 402 continues to transmit the to-be-identified voice signal to the server 401.
  • the server 401 receives the to-be-identified speech signal sent by the terminal device 402, and performs speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.
  • the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 402 after inputting the voice wake-up word. Based on this, the terminal device 402 may further receive the user input before transmitting the to-be-identified voice signal to the server 401.
  • the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device 402.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
  • the manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquires and wakes up the voice.
  • the dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
  • another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects, respectively, and acquiring the awakened words with the voice
  • the dialect of the acoustic feature matches the dialect of the second setting requirement as the first dialect.
  • another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and awakening the text wake-up word to the reference text corresponding to the different dialect respectively The words are matched, and the dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement is obtained as the first dialect.
  • the manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs is similar to the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs. For detailed description, refer to the foregoing embodiment, and details are not described herein again.
  • the manner in which the terminal device 402 receives the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; acquiring a voice wake-up word input by the user based on the voice input interface.
  • the terminal device 402 may output voice input prompt information to prompt the user to perform voice input; and thereafter, receive the voice signal to be recognized input by the user.
  • the terminal device 402 may receive a notification message returned by the server 401, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected. Based on this, after determining that the server 401 has selected the ASR model corresponding to the first dialect, the terminal device 402 may output voice input prompt information to the user to prompt the user to perform voice input, so that the voice signal to be recognized input by the user may be sent to The server 401 after the server 401 can directly recognize the speech signal to be recognized according to the selected ASR model.
  • the server 401 may collect the predictions of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extract the features of different dialects to obtain different dialects. Acoustic characteristics; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects. For detailed procedures for constructing an ASR model corresponding to each dialect, refer to the prior art, and details are not described herein again.
  • the server 401 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 402.
  • the server 401 may return the text content recognized by the voice to the terminal device 402; or the server 401 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 402.
  • the terminal device 402 receives the speech recognition result returned by the server 401 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result.
  • FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in FIG. 4, mainly from the perspective of the terminal device. As shown in FIG. 5, the method includes:
  • a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying a text content, such as "on”, “Tmall Elf", “hello”, and the like.
  • the terminal device receives the voice wake-up word sent by the user, and sends a voice wake-up word to the server, so that the server identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs ), to provide a basis for speech recognition using the ASR model corresponding to the corresponding dialect.
  • the dialect to which the speech wake-up word belongs is recorded as the first dialect.
  • the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR model corresponding to the different dialects according to the first dialect to which the voice wake-up word belongs. Then, the terminal device continues to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
  • the method before transmitting the to-be-identified voice signal to the server, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the voice signal to be recognized input by the user.
  • the method before outputting the voice input prompt information, the method further includes: receiving a notification message returned by the server, the notification message being used to indicate that the ASR model corresponding to the first dialect has been selected.
  • FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 4, mainly from the perspective of the server. As shown in FIG. 6, the method includes:
  • the server receives the voice wake-up word sent by the terminal device, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent to-be-identified voice signal belongs (that is, the dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect.
  • Speech recognition provides the foundation.
  • the dialect to which the speech wake-up word belongs is recorded as the first dialect.
  • the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing the automation of multi-utter speech recognition. And automatically select the ASR model of the corresponding dialect based on the voice wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to improve the efficiency of multi-dial speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
  • one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words.
  • the dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
  • another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching acoustic features of the voice wake-up words with acoustic features of different dialects respectively, and acquiring acoustics with the voice wake-up words
  • the dialect of the feature matching the second setting requirement is used as the first dialect.
  • the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and the text wake-up word respectively corresponding to the dialect word corresponding to different dialects The matching is performed to obtain a dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement as the first dialect.
  • the method before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain Acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
  • the server may return the speech recognition result or the association information of the speech recognition result to the terminal device.
  • the server may return the text content recognized by the voice to the terminal device; or, the song, the video, and the like that match the voice recognition result may be returned to the terminal device.
  • the speech recognition of the multi-dial is performed by the terminal device and the server, but is not limited thereto.
  • the processing function and the storage function of the terminal device or the server are sufficiently powerful, the multi-word speech recognition function can be separately integrated on the terminal device or the server.
  • still another exemplary embodiment of the present application provides a voice recognition method independently implemented by a server or a terminal device.
  • the server and the terminal device are collectively referred to as an electronic device.
  • the voice recognition method independently implemented by the server or the terminal device includes the following steps:
  • a voice wake-up word may be input to the electronic device, and the voice wake-up word is a voice signal specifying the text content, such as "on”, “Tmall Elf", “hello”, and the like.
  • the electronic device receives the voice wake-up word sent by the user, and identifies the first dialect to which the voice wake-up word belongs.
  • the first dialect refers to the dialect to which the awakening words of speech belong, such as Mandarin dialect, Jin dialect, Xiang dialect and so on.
  • the electronic device selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the subsequent to-be-identified voice signals based on the ASR model corresponding to the first dialect.
  • the electronic device stores in advance an ASR model corresponding to different dialects.
  • an ASR model corresponding to a dialect, or several similar dialects may also correspond to the same ASR model, which is not limited thereto.
  • the ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
  • the electronic device uses the ASR model corresponding to the first dialect to perform speech recognition on the speech signal to be recognized.
  • the to-be-identified voice signal may be a voice signal that the user continues to input to the electronic device after inputting the voice wake-up word, based on which the electronic device performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect. It is also possible to receive a speech signal to be recognized input by the user.
  • the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the electronic device, based on which the electronic device may directly obtain the voice signal to be recognized from the local.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
  • one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words.
  • the dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
  • another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching acoustic features of the voice wake-up words with acoustic features of different dialects respectively, and acquiring acoustics with the voice wake-up words
  • the dialect of the feature matching the second setting requirement is used as the first dialect.
  • the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and the text wake-up word respectively corresponding to the dialect word corresponding to different dialects The matching is performed to obtain a dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement as the first dialect.
  • the receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
  • the method before performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the user input to be recognized. voice signal.
  • the method before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain different The acoustic characteristics of dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
  • the electronic device may perform subsequent processing based on the speech recognition result or the association information of the speech recognition result.
  • the voice wake-up word may be preset; or, the user may be allowed to customize the wake-up word.
  • the custom wake-up word or the preset wake-up word mainly refers to the content and/or tone of the wake-up word.
  • the function of the custom voice wake-up word can be implemented by the terminal device or by the server.
  • the function of the custom speech wake-up word may be provided by a device that recognizes the dialect to which the speech wake-up word belongs.
  • the terminal device can provide the user with an entry for a custom wake-up word.
  • the portal can be implemented as a physical button, based on which the user can click on the physical button to trigger a wake-up word customization operation.
  • the entry may be a wake-up word customization sub-item in a setting option of the terminal device, based on which the user may enter a setting option of the terminal device, and then click, hover or long-press for the wake-up word customization sub-item, etc. Operation, which triggers a wake-up word custom action.
  • the terminal device can receive the customized voice signal input by the user in response to the wake-up word custom operation, and save the received custom voice signal as a voice wake-up. word.
  • the terminal device can display an audio entry page to the user to record a customized voice signal sent by the user. For example, after the user triggers the wake-up word customization operation, the terminal device displays the audio input page to the user. At this time, the user can input the voice signal “hello”, and the terminal device will receive the voice signal after receiving the voice signal “hello”. "Hello” is set to the voice wake up word.
  • the terminal device may maintain a wake-up vocabulary and save the user-defined voice wake-up words to the wake-up vocabulary.
  • the voice wake-up word should not be too long to reduce the difficulty in identifying the dialect, but it should not be too short.
  • the speech wake-up word is too short, the recognition is not high, and it is easy to cause false wake-up.
  • the voice wake-up word can be between 3 and 5 characters, but is not limited thereto.
  • the one character here refers to one Chinese character or one English letter.
  • the voice wake-up word is mainly used to wake up or activate the voice recognition function of the application, and may not define the dialect to which the voice wake-up word belongs, that is, the user may use any dialect or Mandarin to issue the voice wake-up word.
  • the user may re-issue a voice signal having a dialect indicating meaning, for example, the voice signal may be a voice signal whose contents are "Tianjin dialect", “Henan dialect”, “enable Minnan dialect", and the like.
  • the dialect that needs speech recognition can be parsed from the voice signal with the dialect indicating meaning sent by the user, and then the ASR model corresponding to the parsed dialect is selected from the ASR models corresponding to different dialects, and based on the selected The ASR model performs speech recognition on subsequent speech signals to be recognized.
  • a speech signal having a dialect indicating meaning herein is referred to as a first speech signal
  • a dialect parsed from the first speech signal is referred to as a first dialect.
  • the voice signal having the dialect guiding meaning can be used as the first voice signal in the embodiment of the present application.
  • the first speech signal may be a speech signal emitted by the user in the first dialect such that the first dialect may be identified based on the acoustic characteristics of the first speech signal.
  • the first voice signal may be a voice signal containing the name of the first dialect, for example, in the voice signal "Please enable the Minnan dialect model", the "Minnan dialect" is the name of the first dialect. Based on this, the phoneme segment corresponding to the name of the first dialect can be extracted from the first voice signal, thereby identifying the first dialect.
  • the above-mentioned voice recognition method combining the voice wake-up word and the first voice signal may be implemented by the terminal device and the server, or may be implemented independently by the terminal device or the server. The following will explain the different implementations separately:
  • Mode A The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server.
  • the terminal device supports a voice wake-up function.
  • the voice wake-up word can be input to the terminal device to wake up the voice recognition function.
  • the terminal device receives the voice wake-up word to wake up the voice recognition function.
  • the user inputs a first voice signal having a dialect guiding meaning to the terminal device; after receiving the first voice signal input by the user, the terminal device parses the first dialect that needs voice recognition from the first voice signal, that is, the subsequent to be recognized The dialect to which the speech signal belongs, thereby providing a basis for speech recognition using the corresponding ASR model of the dialect.
  • the terminal device After parsing the first dialect from the first voice signal, the terminal device sends a service request to the server, where the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the server After receiving the service request sent by the terminal device, the server selects the ASR model corresponding to the first dialect from the ASR model corresponding to the different dialects according to the indication of the service request, so as to perform the subsequent to-be-identified voice signal based on the ASR model corresponding to the first dialect. Speech Recognition.
  • the terminal device After transmitting the service request to the server, the terminal device continues to send the to-be-identified voice signal to the server, where the to-be-identified voice signal belongs to the first dialect.
  • the server receives the to-be-identified voice signal sent by the terminal device, and performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the selected first dialect.
  • the matching ASR model for speech recognition is beneficial to improve the accuracy of speech recognition.
  • the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and based on this, the terminal device may further receive the user input before sending the to-be-identified voice signal to the server. Identify the voice signal.
  • the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.
  • the speech wake-up word is primarily used to wake up the speech recognition function of the terminal device; and the first dialect that subsequently requires speech recognition may be provided by the first speech signal. Based on this, it is possible to not limit the language used by the user to issue a voice wake-up word.
  • the user can issue a speech wake-up word using Mandarin, or can also use a first dialect to issue a speech wake-up word, or can also use a different dialect than the first dialect to issue a speech wake-up word.
  • the terminal device may preferentially parse the first dialect from the first voice signal; if the first dialect cannot be parsed from the first voice signal, the voice may be recognized.
  • the dialect to which the awakening word belongs is used as the first dialect.
  • the implementation manner of specifically identifying the dialect to which the voice wake-up word belongs is the same as the embodiment of the dialect in which the voice wake-up word is recognized in the above embodiment, and details are not described herein again.
  • Mode B The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server.
  • the terminal device is mainly configured to receive the voice wake-up word and the first voice signal input by the user and report the signal to the server, so that the server parses the first dialect from the first voice signal, which is different from the mode A.
  • Terminal Equipment the server provides the ASR model for different dialects and selects the corresponding ASR model for speech recognition of the speech signal in the corresponding dialect. It also has the function of parsing the first dialect from the first speech signal.
  • a voice wake-up word can be input to the terminal device.
  • the terminal device receives the voice wake-up word input by the user, and sends the voice wake-up word to the server.
  • the server wakes up its own speech recognition function based on the voice wake up words.
  • the user may continue to send the first voice signal to the terminal device.
  • the terminal device transmits the received first voice signal to the server.
  • the server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the subsequent voice of the first dialect based on the ASR model corresponding to the first dialect.
  • the signal is speech recognized.
  • the terminal device After transmitting the first voice signal to the server, the terminal device continues to send the to-be-identified voice signal to the server.
  • the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech.
  • the to-be-identified voice may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and the terminal device may further receive the user input to be recognized before sending the to-be-identified voice signal to the server. voice signal.
  • the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.
  • the method before the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying the voice wake-up words The dialect to which it belongs is the first dialect.
  • the server when parsing the first dialect that requires speech recognition from the first speech signal, includes: converting the first speech signal to the first phoneme sequence based on the acoustic model; storing the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • Mode C The above voice recognition method combining the voice wake-up word and the first voice signal is separately implemented by the terminal device or the server.
  • a voice wake-up word can be input to the terminal device or the server.
  • the terminal device or the server wakes up the voice recognition function according to the voice wake-up word input by the user.
  • the user may continue to input the first voice signal having the dialect guiding meaning to the terminal device or the server.
  • the terminal device or the server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the terminal device or the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech.
  • the to-be-identified voice may be a voice signal that the user continues to input to the terminal device or the server after inputting the first voice signal, and the terminal device or the server performs the voice signal to be recognized by using the ASR model corresponding to the first dialect.
  • the voice signal to be recognized input by the user may also be received.
  • the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally at the terminal device or the server, based on which the terminal device or the server may directly obtain the voice signal to be recognized from the local.
  • the method before the terminal device or the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying The dialect to which the voice wake-up word belongs is used as the first dialect.
  • the terminal device or the server when parsing the first dialect that needs to perform speech recognition from the first speech signal, includes: converting the first speech signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names stored in the first phoneme sequence are matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • parsing the first dialect that needs to perform speech recognition from the first voice signal including: converting the first voice signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • preprocessing and feature extraction of the first speech signal are required.
  • the preprocessing process includes pre-emphasis, windowing framing, and endpoint detection.
  • the feature extraction is to extract the acoustic features such as time domain features or frequency domain features of the preprocessed first speech signal.
  • the acoustic model can convert the acoustic characteristics of the first speech signal into a phoneme sequence.
  • Phonemes are the basic elements that make up the pronunciation of a word or the pronunciation of a Chinese character. Among them, the phonemes constituting the pronunciation of a word may be 39 phonemes invented by Carnegie Mellon University; the phonemes constituting the pronunciation of Chinese characters may be all initials and finals.
  • Acoustic models include, but are not limited to, neural network based deep learning models, hidden Markov models, and the like. The manner of converting the acoustic features into the phoneme sequences belongs to the prior art, and details are not described herein again.
  • the terminal device or the server After converting the first voice signal into the first phoneme sequence, the terminal device or the server respectively matches the phoneme segments corresponding to the different dialect names in the first phoneme sequence.
  • phoneme fragments of different dialect names may be pre-stored, for example, a phoneme fragment of the dialect name "Henan dialect", a phoneme fragment of the dialect name "Minnan”, a dialect name "British English", and the like. If the dialect name is a word, the phoneme fragment is a segment composed of several phonemes obtained from the 39 phonemes invented by Carnegie Mellon University. If the dialect name is a Chinese character, the phoneme fragment is a fragment composed of the initials and finals of the dialect name.
  • the phoneme segments corresponding to the different dialect names stored in advance are compared to determine whether the first phoneme sequence contains a phoneme segment identical or similar to the phoneme segment of a certain dialect name.
  • a similarity between each phoneme segment in the first phoneme sequence and a phoneme segment in a different dialect name may be calculated; and from a phoneme segment of a different dialect name, selecting a similarity with a phoneme segment in the first phoneme sequence satisfies
  • the phoneme fragment required by the preset similarity is used as the audio segment in the matching.
  • the dialect corresponding to the phoneme segment in the match is used as the first dialect.
  • FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • the voice recognition apparatus 800 includes a receiving module 801, an identifying module 802, a first transmitting module 803, and a second transmitting module 804.
  • the receiving module 801 is configured to receive a voice wake-up word.
  • the identification module 802 is configured to identify a first dialect to which the voice wake-up word received by the receiving module 801 belongs.
  • the first sending module 803 is configured to send a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the second sending module 804 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
  • the method is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and acquire and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained.
  • the dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained.
  • the dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
  • the receiving module 801 when receiving the voice wake-up word, is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface. .
  • the second sending module 804 before sending the to-be-identified voice signal to the server, is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
  • the second sending module 804 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  • the receiving module 801 before receiving the voice wake-up word, is further configured to: receive a custom voice signal input by the user in response to the wake-up word customization operation; and save the customized voice signal as a voice wake-up word.
  • the internal function and structure of the speech recognition apparatus 800 are described above. As shown in FIG. 9, in actuality, the speech recognition apparatus 800 can be implemented as a terminal apparatus, including: a memory 901, a processor 902, and a communication component 903.
  • the memory 901 is configured to store a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 901 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 902 is coupled to the memory 901 for executing a computer program in the memory 901 for: receiving a voice wake-up word through the communication component 903; identifying a first dialect to which the voice wake-up word belongs; and transmitting a service to the server through the communication component 903
  • the request is to request the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and send the to-be-identified voice signal to the server through the communication component 903, so that the server uses the ASR model corresponding to the first dialect to perform the recognized speech signal.
  • Speech Recognition is to request the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the communication component 903 is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
  • the processor 902 when the processor 902 identifies the first dialect to which the voice wake-up word belongs, the processor 902 is specifically configured to:
  • the voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects, and the dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement is obtained as the first dialect; or the voice is spoken
  • the acoustic features of the wake-up words are respectively matched with the acoustic features of different dialects, and the dialects that match the acoustic characteristics of the speech wake-up words according to the second setting requirement are obtained as the first dialect; or the speech wake-up words are converted into the text wake-up words.
  • the text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the dialect corresponding to the reference text wake-up words whose matching degree with the text wake-up words meets the third setting requirement is obtained as the first dialect.
  • the terminal device further includes: a display screen 904.
  • the processor 902 when receiving the voice wake-up word, is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 904; and acquire a voice wake-up word input by the user based on the voice input interface. .
  • the terminal device further includes: an audio component 906.
  • the processor 902 is further configured to: output the voice input prompt information through the audio component 906 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user through the audio component 906.
  • the audio component 906 is further configured to output voice input prompt information and receive a voice signal to be recognized input by the user.
  • the processor 902 before outputting the voice input prompt information, is further configured to: receive, by the communication component 903, a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  • the processor 902 before receiving the voice wake-up word, is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 903; save the customized voice signal as Voice wake up words.
  • the terminal device further includes: a power component 905 and other components.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the terminal device in the foregoing method embodiment can be implemented.
  • FIG. 10 is a schematic structural diagram of a module of another voice recognition apparatus according to still another exemplary embodiment of the present application.
  • the voice recognition apparatus 1000 includes a first receiving module 1001, a selecting module 1002, a second receiving module 1003, and an identifying module 1004.
  • the first receiving module 1001 is configured to receive a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected.
  • the selecting module 1002 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, where the first dialect is a dialect to which the voice wake-up word belongs.
  • the second receiving module 1003 is configured to receive a to-be-identified voice signal sent by the terminal device.
  • the identification module 1004 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1003 by using the ASR model corresponding to the first dialect.
  • the voice recognition apparatus 1000 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • a building module configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • the speech recognition apparatus 1000 can implement a server including a memory 1101, a processor 1102, and a communication component 1103.
  • the memory 1101 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 1101 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 1102 is coupled to the memory 1101, and is configured to execute a computer program in the memory 1101, configured to: receive, by the communication component 1103, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected; from different dialects In the corresponding ASR model, the ASR model corresponding to the first dialect is selected, and the first dialect is a dialect to which the voice wake-up word belongs; the communication component 1103 receives the to-be-identified voice signal sent by the terminal device, and treats the ASR model corresponding to the first dialect.
  • the speech signal is recognized for speech recognition.
  • the communication component 1103 is configured to receive the service request and the to-be-identified voice signal.
  • the processor 1102 is configured to: collect corpora of different dialects before extracting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; perform feature extraction on corpora of different dialects, In order to obtain the acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
  • the server further includes an audio component 1106.
  • the processor 1102 is further configured to: receive, by the audio component 1106, the to-be-identified voice signal sent by the terminal device.
  • the server further includes a display 1104, a power component 1105, and the like.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the server in the foregoing method embodiment can be implemented.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-dial speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to Improve the efficiency of multi-word speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
  • FIG. 12 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • the voice recognition apparatus 1200 includes a receiving module 1201, a first transmitting module 1202, and a second transmitting module 1203.
  • the receiving module 1201 is configured to receive a voice wake-up word.
  • the first sending module 1202 is configured to send, to the server, the voice wake-up words received by the receiving module 1201, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words.
  • the second sending module 1203 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
  • the receiving module 1201 when receiving the voice wake-up word, is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface.
  • the second sending module 1203 before sending the to-be-identified voice signal to the server, is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
  • the second sending module 1203 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  • the receiving module 1201 before receiving the voice wake-up word, is further configured to: receive a customized voice signal input by the user in response to the wake-up word customization operation.
  • the first sending module 1202 is further configured to upload a customized voice signal to the server.
  • the speech recognition apparatus 1200 can be implemented as a terminal device, including: a memory 1301, a processor 1302, and a communication component 1303.
  • the memory 1301 is for storing a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 1301 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 1302 is coupled to the memory 1301 for executing a computer program in the memory 1301 for: receiving a voice wake-up word through the communication component 1303; and transmitting a voice wake-up word to the server through the communication component 1303 for the server to wake-up words based on the voice
  • the ASR model corresponding to the first dialect to which the voice wake-up word belongs is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component 1303, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal. Identification.
  • the communication component 1303 is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server
  • the terminal device further includes a display screen 1304.
  • the processor 1302 when receiving the voice wake-up word, is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1304; and acquire a voice wake-up word input by the user based on the voice input interface. .
  • the terminal device further includes an audio component 1306.
  • the processor 1302 is configured to receive the speech wake-up words through the audio component 1306.
  • the processor 1302 is further configured to: output the voice input prompt information through the audio component 1306 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
  • the processor 1302 before outputting the voice input prompt information, is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  • the processor 1302 before receiving the voice wake-up word, is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1303, and upload the customized voice signal. To the server.
  • the terminal device further includes: a power component 1305 and other components.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the terminal device in the foregoing method embodiment can be implemented.
  • FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • the voice recognition apparatus 1400 includes a first receiving module 1401, a first identifying module 1402, a selecting module 1403, a second receiving module 1404, and a second identifying module 1405.
  • the first receiving module 1401 is configured to receive a voice wake-up word sent by the terminal device.
  • the first identification module 1402 is configured to identify a first dialect to which the voice wake-up word belongs.
  • the selecting module 1403 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the second receiving module 1404 is configured to receive a to-be-identified voice signal sent by the terminal device.
  • the second identification module 1405 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1404 by using the ASR model corresponding to the first dialect.
  • the first identifier module 1402 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice
  • the dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained.
  • the dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.
  • the speech recognition apparatus 1400 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • a building module configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • the speech recognition apparatus 1400 can be implemented as a server including: a memory 1501, a processor 1502, and a communication component 1503.
  • the memory 1501 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 1501 can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 1502 is coupled to the memory 1501 for executing a computer program in the memory 1501, for: receiving, by the communication component 1503, a voice wake-up word sent by the terminal device; identifying a first dialect to which the voice wake-up word belongs; corresponding to different dialects In the ASR model, the ASR model corresponding to the first dialect is selected; the voice signal to be recognized sent by the terminal device is received by the communication component 1503, and the voice signal to be recognized is identified by the ASR model corresponding to the first dialect.
  • the communication component 1503 is configured to receive a voice wake-up word and a voice signal to be recognized.
  • the processor 1502 when identifying the first dialect to which the voice wake-up word belongs, is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained.
  • the dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained.
  • the dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
  • the processor 1502 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.
  • the server further includes an audio component 1506.
  • the processor 1502 is configured to: receive, by the audio component 1506, a voice wake-up word sent by the terminal device, and receive, by the audio component 1506, the voice signal to be recognized sent by the terminal device.
  • the server further includes: a display 1504, a power component 1505, and the like.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the server in the foregoing method embodiment can be implemented.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
  • FIG. 16 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application.
  • the voice recognition apparatus 1600 includes a receiving module 1601, a first identifying module 1602, a selecting module 1603, and a second identifying module 1604.
  • the receiving module 1601 is configured to receive a voice wake-up word.
  • the first identification module 1602 is configured to identify a first dialect to which the voice wake-up word belongs.
  • the selecting module 1603 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
  • the second identification module 1604 is configured to perform speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.
  • the first identifying module 1602 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice
  • the dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained.
  • the dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.
  • the receiving module 1601 when receiving the voice wake-up word sent by the terminal device, is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire user input based on the voice input interface. Voice wake up words.
  • the second identification module 1604 is further configured to: output voice input prompt information to prompt the user to perform voice input, and receive the user before using the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.
  • the voice signal to be recognized is input.
  • the voice recognition apparatus 1600 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in an ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • a building module configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in an ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
  • the receiving module 1601 before receiving the voice wake-up word, is further configured to: receive a custom voice signal input by the user in response to the wake-up word custom operation; and save the customized voice signal as a voice wake-up word.
  • the speech recognition apparatus 1600 can be implemented as an electronic device including: a memory 1701, a processor 1702, and a communication component 1703.
  • the electronic device can be a terminal device or a server.
  • the memory 1701 is configured to store a computer program and can be stored to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method for operation on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 1701 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 1702 is coupled to the memory 1701 for executing a computer program in the memory 1701 for: receiving a speech wake-up word through the communication component 1703; identifying a first dialect to which the speech wake-up word belongs; from an ASR model corresponding to different dialects The ASR model corresponding to the first dialect is selected; the ASR model corresponding to the first dialect is used to perform speech recognition on the speech signal to be recognized.
  • the communication component 1703 is configured to receive a voice wake-up word.
  • the processor 1702 when the processor 1702 identifies the first dialect to which the voice wake-up word belongs, the processor 1702 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained.
  • the dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained.
  • the dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
  • the electronic device further includes: a display screen 1704.
  • the processor 1702 when receiving the voice wake-up words sent by the terminal device, the processor 1702 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1704; and obtain user input based on the voice input interface. Voice wake up words.
  • the electronic device further includes an audio component 1706.
  • the processor 1702 is further configured to: output the voice input prompt information through the audio component 1706 to prompt the user to perform voice input; and receive the user input.
  • the speech signal to be recognized is further configured to: receive the speech wake-up words through the audio component 1706.
  • the processor 1702 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.
  • the processor 1702 before receiving the voice wake-up word, is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1703; save the customized voice signal as Voice wake up words.
  • the electronic device further includes: a power component 1705 and other components.
  • the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the electronic device in the foregoing method embodiment can be implemented.
  • the ASR model is constructed for different dialects.
  • the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects.
  • the selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
  • the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
  • the embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component.
  • a memory for storing a computer program can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and transmitting a service request to the server through the communication component to request the server
  • the ASR model corresponding to the first dialect is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.
  • the communication component is configured to receive a voice wake-up word and the first voice signal, and send a service request and a voice signal to be recognized to the server.
  • the processor before sending the service request to the server, is further configured to: if the first dialect is not parsed from the first voice signal, identify a dialect to which the voice wake-up word belongs as the first dialect.
  • the memory is further configured to store phoneme segments corresponding to different dialect names.
  • the processor when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • the embodiment of the present application further provides a server, including: a memory, a processor, and a communication component.
  • the memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • a processor coupled to the memory and the communication component, for executing a computer program in the memory, for: receiving, by the communication component, a voice wake-up word sent by the terminal device to wake up the voice recognition function; receiving, by the communication component, the terminal device sends the a dialect indicating a meaning of the first voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; receiving the terminal device by the communication component
  • the speech signal to be recognized, and the speech signal to be recognized by the ASR model corresponding to the first dialect is used for speech recognition.
  • a communication component configured to receive a voice wake-up word, a first voice signal, and a voice signal to be recognized.
  • the processor before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify the voice The dialect to which the awakening word belongs is used as the first dialect.
  • the memory is further configured to store phoneme segments corresponding to different dialect names.
  • the processor when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • the embodiment of the present application further provides an electronic device, which may be a terminal device or a server.
  • the electronic device includes a memory, a processor, and a communication component.
  • a memory for storing computer programs and stored to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a speech signal; parsing a first dialect that requires speech recognition from the first speech signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and using the ASR model corresponding to the first dialect to treat the speech signal Perform speech recognition.
  • a communication component configured to receive the voice wake-up word and the first voice signal.
  • the processor before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify the voice The dialect to which the awakening word belongs is used as the first dialect.
  • the memory is further configured to store phoneme segments corresponding to different dialect names.
  • the processor when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
  • the communication components in Figures 9, 11, 13, 15, and 17 above are stored to facilitate wired or wireless communication between the device in which the communication component is located and other devices.
  • the device in which the communication component is located can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel.
  • the communication component also includes a near field communication (NFC) module to facilitate short range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the display screens of FIGS. 9, 11, 13, 15, and 17 described above include a liquid crystal display (LCD) and a touch panel (TP). If the display includes a touch panel, the display can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the power components in Figures 9, 11, 13, 15, and 17 above provide power to the various components of the device in which the power components are located.
  • the power components can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.
  • the audio component includes a microphone (MIC) that is stored to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal can be further stored in a memory or transmitted via a communication component.
  • the audio component further includes a speaker for outputting an audio signal.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.

Abstract

A speech recognition method, comprising: receiving a speech wake-up word (21); recognizing a first dialect to which the speech wake-up word belongs (22); sending to a server a service request to request the server to select an ASR model corresponding to the first dialect from ASR models corresponding to different dialects (23); and sending to the server a speech signal to be recognized to enable the server to perform speech recognition on the speech signal to be recognized using the ASR model corresponding to the first dialect (24). Speech recognition can be automatically performed on multiple dialects, the efficiency of speech recognition for multiple dialects are improved. Also provided is a speech recognition device and system.

Description

语音识别方法、装置及系统Speech recognition method, device and system
本申请要求2017年11月17日递交的申请号为201711147698.X、发明名称为“语音识别方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
技术领域Technical field
本申请涉及语音识别技术领域,尤其涉及一种语音识别方法、装置及系统。The present application relates to the field of voice recognition technologies, and in particular, to a voice recognition method, apparatus, and system.
背景技术Background technique
自动语音识别(Automatic Speech Recognition,ASR)是一种可以把人类的语音音频信号转换为文本内容的技术。随着软硬件技术的发展,各种智能设备的计算能力和存储容量有了很大进步,使得语音识别技术在智能设备中得以广泛应用。Automatic Speech Recognition (ASR) is a technology that converts human voice audio signals into text content. With the development of software and hardware technologies, the computing power and storage capacity of various smart devices have been greatly improved, making voice recognition technology widely used in smart devices.
在语音识别技术中,需要准确识别语音音素,基于准确识别的语音音素才能转换为文本。但是,不论是哪种语言,都会因为各种因素导致该语言有多种不同的发音,即多方言。以汉语为例,有官话方言、晋语、湘语、赣语、吴语、闽语、粤语、客语等多种方言,不同方言的发音差异较大。In speech recognition technology, it is necessary to accurately recognize speech phonemes, and based on accurately recognized speech phonemes, it can be converted into text. However, regardless of the language, the language has many different pronunciations due to various factors, namely, multiple dialects. Taking Chinese as an example, there are many dialects such as Mandarin dialect, Jin dialect, Xiang dialect, Yi dialect, Wu dialect, Yi dialect, Cantonese dialect, and guest dialect. The pronunciation of different dialects is quite different.
目前,针对方言的语音识别方案尚不成熟,有待针对多方言问题提供一种解决方案。At present, the speech recognition scheme for dialects is still not mature, and it is necessary to provide a solution to the multi-dialect problem.
发明内容Summary of the invention
本申请的多个方面提供一种语音识别方法、装置及系统,用以自动化地对多方言进行语音识别,提高针对多方言进行语音识别的效率。Aspects of the present application provide a speech recognition method, apparatus, and system for automatically performing speech recognition on a plurality of dialects, and improving the efficiency of speech recognition for a plurality of dialects.
本申请实施例提供一种语音识别方法,适用于终端设备,该方法包括:The embodiment of the present application provides a voice recognition method, which is applicable to a terminal device, and the method includes:
接收语音唤醒词;Receiving a speech wake-up word;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
本申请实施例还提供一种语音识别方法,适用于服务器,该方法包括:The embodiment of the present application further provides a voice recognition method, which is applicable to a server, and the method includes:
接收终端设备发送的服务请求,所述服务请求指示选择第一方言对应的ASR模型;Receiving a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;
从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,所述第一方言是所述语音唤醒词所属的方言;Selecting, in the ASR model corresponding to different dialects, an ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word belongs;
接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
本申请实施例还提供一种语音识别方法,适用于终端设备,该方法包括:The embodiment of the present application further provides a voice recognition method, which is applicable to a terminal device, and the method includes:
接收语音唤醒词;Receiving a speech wake-up word;
向服务器发送所述语音唤醒词,以供服务器基于所述语音唤醒词从不同方言对应的ASR模型中选择所述语音唤醒词所属第一方言对应的ASR模型;Sending the voice wake-up word to the server, for the server to select an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;
向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
本申请实施例还提供一种语音识别方法,适用于服务器,该方法包括:The embodiment of the present application further provides a voice recognition method, which is applicable to a server, and the method includes:
接收终端设备发送的语音唤醒词;Receiving a voice wake-up word sent by the terminal device;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
本申请实施例还提供一种语音识别方法,包括:The embodiment of the present application further provides a voice recognition method, including:
接收语音唤醒词;Receiving a speech wake-up word;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;
利用所述第一方言对应的ASR模型对待识别语音信号进行语音识别。The ASR model corresponding to the first dialect is used for speech recognition of the speech signal to be recognized.
本申请实施例还提供一种语音识别方法,适用于终端设备,该方法包括:The embodiment of the present application further provides a voice recognition method, which is applicable to a terminal device, and the method includes:
接收语音唤醒词,以唤醒语音识别功能;Receiving a voice wake-up word to wake up the voice recognition function;
接收用户输入的具有方言指示意义的第一语音信号;Receiving a first voice signal input by a user with a dialect indicating meaning;
从所述第一语音信号中解析出需要进行语音识别的第一方言;Parsing a first dialect that requires speech recognition from the first speech signal;
向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
本申请实施例还提供一种终端设备,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
通过所述通信组件向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
所述通信组件,用于接收所述语音唤醒词,向所述服务器发送所述服务请求以及所述待识别语音信号。The communication component is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
本申请实施例还提供一种服务器,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides a server, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收终端设备发送的服务请求,所述服务请求指示选择第一方言对应的ASR模型;Receiving, by the communication component, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;
从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,所述第一方言是所述语音唤醒词所属的方言;Selecting, in the ASR model corresponding to different dialects, an ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word belongs;
通过所述通信组件接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
所述通信组件,用于接收所述服务请求和所述待识别语音信号。The communication component is configured to receive the service request and the to-be-identified voice signal.
本申请实施例还提供一种终端设备,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
通过所述通信组件向服务器发送所述语音唤醒词,以供服务器基于所述语音唤醒词从不同方言对应的ASR模型中选择所述语音唤醒词所属第一方言对应的ASR模型;Sending, by the communication component, the voice wake-up words to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;
通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
所述通信组件,用于接收所述语音唤醒词,向所述服务器发送所述语音唤醒词和所述待识别语音信号。The communication component is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server.
本申请实施例还提供一种服务器,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides a server, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收终端设备发送的语音唤醒词;Receiving, by the communication component, a voice wake-up word sent by the terminal device;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
通过所述通信组件接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
所述通信组件,用于接收所述语音唤醒词以及所述待识别语音信号。The communication component is configured to receive the voice wake-up word and the to-be-identified voice signal.
本申请实施例还提供一种电子设备,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides an electronic device, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;
利用所述第一方言对应的ASR模型对待识别语音信号进行语音识别;Performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect;
所述通信组件,用于接收所述语音唤醒词。The communication component is configured to receive the voice wake-up word.
本申请实施例还提供一种终端设备,包括:存储器、处理器以及通信组件;The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;
所述存储器,用于存储计算机程序;The memory for storing a computer program;
所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
通过所述通信组件接收语音唤醒词,以唤醒语音识别功能;Receiving a voice wake-up word through the communication component to wake up the voice recognition function;
通过所述通信组件接收用户输入的具有方言指示意义的第一语音信号;Receiving, by the communication component, a first voice signal input by a user with a dialect indicating meaning;
从所述第一语音信号中解析出需要进行语音识别的第一方言;Parsing a first dialect that requires speech recognition from the first speech signal;
通过所述通信组件向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别Sending, by the communication component, a voice signal to be identified to the server, for the server to perform voice recognition on the voice signal to be recognized by using an ASR model corresponding to the first dialect
所述通信组件,用于接收所述语音唤醒词和所述第一语音信号,以及向所述服务器发送所述服务请求和所述待识别语音信号。The communication component is configured to receive the voice wake-up word and the first voice signal, and send the service request and the to-be-identified voice signal to the server.
本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,所述计算机程 序被计算机执行时能够实现上述第一种语音识别方法实施例中的步骤。The embodiment of the present application further provides a computer readable storage medium storing a computer program, which is capable of implementing the steps in the foregoing first voice recognition method embodiment when the computer program is executed by a computer.
本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,所述计算机程序被计算机执行时能够实现上述第二种语音识别方法实施例中的步骤。The embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed by a computer, the steps in the second voice recognition method embodiment can be implemented.
本申请实施例还提供一种语音识别系统,包括服务器和终端设备;The embodiment of the present application further provides a voice recognition system, including a server and a terminal device;
所述终端设备,用于接收语音唤醒词,识别所述语音唤醒词所属的第一方言,并向所述服务器发送服务请求,以及向所述服务器发送待识别语音信号,所述服务请求指示选择所述第一方言对应的ASR模型;The terminal device is configured to receive a voice wake-up word, identify a first dialect to which the voice wake-up word belongs, send a service request to the server, and send a to-be-identified voice signal to the server, where the service request indicates selection The ASR model corresponding to the first dialect;
所述服务器,用于接收所述服务请求,根据所述服务请求的指示,从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,以及接收所述待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。。The server is configured to receive the service request, select an ASR model corresponding to the first dialect, and receive the to-be-identified voice signal from an ASR model corresponding to different dialects according to the indication of the service request, and Performing speech recognition on the to-be-identified speech signal by using the ASR model corresponding to the first dialect. .
本申请实施例还提供一种语音识别系统,包括服务器和终端设备;The embodiment of the present application further provides a voice recognition system, including a server and a terminal device;
所述终端设备,用于接收语音唤醒词,向所述服务器发送所述语音唤醒词,以及向所述服务器发送待识别语音信号;The terminal device is configured to receive a voice wake-up word, send the voice wake-up word to the server, and send the to-be-identified voice signal to the server;
所述服务器,用于接收所述语音唤醒词,识别所述语音唤醒词所属的第一方言,从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,以及接收所述待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。The server is configured to receive the voice wake-up word, identify a first dialect to which the voice wake-up word belongs, select an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects, and receive the waiting Identifying a voice signal, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
在本申请实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In the embodiment of the present application, the ASR model is constructed for different dialects, and in the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and faster to implement without manual operation by the user. It is beneficial to improve the efficiency of multi-language speech recognition.
附图说明DRAWINGS
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are intended to provide a further understanding of the present application, and are intended to be a part of this application. In the drawing:
图1为本申请一示例性实施例提供的一种语音识别系统的结构示意图;FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application;
图2为本申请另一示例性实施例提供的一种语音识别方法的流程示意图;2 is a schematic flowchart of a voice recognition method according to another exemplary embodiment of the present application;
图3为本申请又一示例性实施例提供的另一种语音识别方法的流程示意图;FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application;
图4为本申请又一示例性实施例提供的另一种语音识别系统的结构示意图;4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application;
图5为本申请又一示例性实施例提供的又一种语音识别方法的流程示意图;FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;
图6为本申请又一示例性实施例提供的又一种语音识别方法的流程示意图;FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;
图7为本申请又一示例性实施例提供的又一种语音识别方法的流程示意图;FIG. 7 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;
图8为本申请又一示例性实施例提供的一种语音识别装置的模块结构示意图;FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to another exemplary embodiment of the present disclosure;
图9为本申请又一示例性实施例提供的一种终端设备的结构示意图;FIG. 9 is a schematic structural diagram of a terminal device according to another exemplary embodiment of the present application;
图10为本申请又一示例性一实施例提供的另一种语音识别装置的模块结构示意图;FIG. 10 is a schematic structural diagram of another voice recognition apparatus according to another exemplary embodiment of the present disclosure;
图11为本申请又一示例性实施例提供的一种服务器的结构示意图;FIG. 11 is a schematic structural diagram of a server according to still another exemplary embodiment of the present application;
图12为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图;FIG. 12 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present application;
图13为本申请又一示例性实施例提供的又一种终端设备的结构示意图;FIG. 13 is a schematic structural diagram of still another terminal device according to another exemplary embodiment of the present disclosure;
图14为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图;FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application;
图15为本申请又一示例性实施例提供的另一种服务器的结构示意图;FIG. 15 is a schematic structural diagram of another server according to another exemplary embodiment of the present disclosure;
图16为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图;FIG. 16 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present disclosure;
图17为本申请又一示例性实施例提供的一种电子设备的结构示意图。FIG. 17 is a schematic structural diagram of an electronic device according to still another exemplary embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions of the present application will be clearly and completely described in the following with reference to the specific embodiments of the present application and the corresponding drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
在现有技术中,针对方言的语音识别方案尚不成熟,针对该技术问题,本申请实施例提供一种解决方案,该方案的主要思路是:针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In the prior art, the speech recognition scheme for the dialect is not mature. For the technical problem, the embodiment of the present application provides a solution, and the main idea of the solution is to construct an ASR model for different dialects in the process of speech recognition. Pre-identifying the dialect to which the speech wake-up word belongs, and then selecting the ASR model corresponding to the dialect to which the speech wake-up word belongs from the ASR models corresponding to different dialects, and using the selected ASR model to perform speech recognition on the subsequent to-be-recognized speech signal, thereby realizing more The dialect speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and fast to implement without manual operation by the user, and is beneficial to improve the efficiency of multi-dial speech recognition.
以下结合附图,详细说明本申请各实施例提供的技术方案。The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
图1为本申请一示例性实施例提供的一种语音识别系统的结构示意图。如图1所示,该语音识别系统100包括:服务器101和终端设备102。服务器101与终端设备102之间通信连接。FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application. As shown in FIG. 1, the speech recognition system 100 includes a server 101 and a terminal device 102. A communication connection is made between the server 101 and the terminal device 102.
例如,终端设备102可以通过互联网与服务器101进行通信连接,或者也可以通过移动网络与服务器101进行通信连接。若终端设备102通过移动网络与服务器101进行通信连接,该移动网络的网络制式可以为2G(GSM)、2.5G(GPRS)、3G(WCDMA、TD-SCDMA、CDMA2000、UTMS)、4G(LTE)、4G+(LTE+)、WiMax等中的任意一种。For example, the terminal device 102 can communicate with the server 101 via the Internet, or can also communicate with the server 101 via a mobile network. If the terminal device 102 is in communication connection with the server 101 through the mobile network, the network standard of the mobile network may be 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE). Any of 4G+ (LTE+), WiMax, and the like.
服务器101主要面向不同方言提供ASR模型,并选择相应ASR模型对相应方言下的语音信号进行语音识别。服务器101可以是任何可提供计算服务,能够响应服务请求,并进行处理的设备,例如可以是常规服务器、云服务器、云主机、虚拟中心等。服务器的构成主要包括处理器、硬盘、内存、系统总线等,和通用的计算机架构类似。The server 101 mainly provides an ASR model for different dialects, and selects a corresponding ASR model to perform speech recognition on the speech signals in the corresponding dialects. The server 101 can be any device that can provide computing services, can respond to service requests, and process, such as a conventional server, a cloud server, a cloud host, a virtual center, and the like. The composition of the server mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture.
在本实施例中,终端设备102主要面向用户,可以向用户提供语音识别的界面或入口。终端设备102的实现形式有多种,例如可以是智能手机、智能音箱、个人电脑、穿戴设备、平板电脑等。终端设备102通常包括至少一个处理单元和至少一个存储器。处理单元和存储器的数量取决于终端设备102的配置和类型。存储器可以包括易失性的,例如RAM,也可以包括非易失性的,例如只读存储器(Read-Only Memory,ROM)、闪存等,或者也可以同时包括两种类型的。存储器内通常存储有操作系统(Operating System,OS)、一个或多个应用程序,也可以存储有程序数据等。除了处理单元和存储器之外,终端设备102还包括一些基本配置,例如网卡芯片、IO总线、音视频组件(例如麦克风)等。可选地,终端设备102还可以包括一些外围设备,例如键盘、鼠标、输入笔、打印机等。这些外围设备在本领域中是总所周知的,在此不做赘述。In this embodiment, the terminal device 102 is mainly oriented to the user, and may provide an interface or portal for voice recognition to the user. The terminal device 102 can be implemented in various forms, such as a smart phone, a smart speaker, a personal computer, a wearable device, a tablet computer, and the like. The terminal device 102 typically includes at least one processing unit and at least one memory. The number of processing units and memories depends on the configuration and type of terminal device 102. The memory may include volatile, such as RAM, and may also include non-volatile, such as Read-Only Memory (ROM), flash memory, etc., or both. An operating system (OS), one or more applications, and program data are stored in the memory. In addition to the processing unit and memory, the terminal device 102 also includes some basic configurations, such as a network card chip, an IO bus, an audio and video component (such as a microphone), and the like. Optionally, the terminal device 102 may also include some peripheral devices such as a keyboard, a mouse, a stylus, a printer, and the like. These peripheral devices are well known in the art and will not be described herein.
在本实施例中,终端设备102与服务器101相互配合,可以向用户提供语音识别功能。另外,考虑到在一些情况下,终端设备102会被多个用户使用,多个用户可能持不同方言。以汉语为例,以地域划分可包括以下几类方言:官话方言、晋语、湘语、赣语、吴语、闽语、粤语、客语。进一步,一些方言还可以细分,例如闽语又可以包括闽北话、闽南话、闽东话、闽中话、莆仙话等。不同方言的发音差异较大,无法用同一ASR模型进行语音识别。因此,在本实施例中,针对不同方言分别构建ASR模型,以便于对不同方言进行语音识别。进而,基于终端设备102与服务器101之间的相互配合,可以向持不同方言的用户提供语音识别功能,即可以对持不同方言的用户的语音信号进行语音识别。In this embodiment, the terminal device 102 and the server 101 cooperate with each other to provide a voice recognition function to the user. In addition, it is considered that in some cases, the terminal device 102 may be used by multiple users, and multiple users may hold different dialects. Taking Chinese as an example, the geographical division can include the following types of dialects: Mandarin dialect, Jin dialect, Xiang dialect, Yi dialect, Wu dialect, Yi dialect, Cantonese, and Hakka. Further, some dialects can also be subdivided. For example, proverbs can include Minbei dialect, Minnan dialect, Mindong dialect, Suizhong dialect, and Zhuxian dialect. The pronunciation of different dialects is quite different, and the same ASR model cannot be used for speech recognition. Therefore, in the present embodiment, the ASR models are separately constructed for different dialects in order to perform speech recognition on different dialects. Further, based on the cooperation between the terminal device 102 and the server 101, a voice recognition function can be provided to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.
为了提高语音识别效率,终端设备102支持语音唤醒词功能,即当用户想要进行语音识别时,可以向终端设备102输入语音唤醒词,以唤醒语音识别功能。该语音唤醒词 是指定文本内容的语音信号,例如可以是“开启”、“天猫精灵”、“hello”等。终端设备102接收用户输入的语音唤醒词,识别该语音唤醒词所属的方言,进而可确定后续待识别语音信号所属的方言(即该语音唤醒词所属的方言),为采用相应方言对应的ASR模型进行语音识别提供基础。为便于描述和区分,将语音唤醒词所属的方言记为第一方言。其中,语音唤醒词所属的第一方言可以是任何一种语种下的任何一种方言。In order to improve the voice recognition efficiency, the terminal device 102 supports the voice wake-up word function, that is, when the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device 102 to wake up the voice recognition function. The voice wake-up word is a voice signal specifying a text content, and may be, for example, "on", "Tmall Elf", "hello", and the like. The terminal device 102 receives the voice wake-up word input by the user, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs), and adopts the ASR model corresponding to the corresponding dialect. Provide a basis for speech recognition. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect. The first dialect to which the voice wake-up word belongs may be any dialect in any language.
终端设备102在识别出语音唤醒词所属的第一方言之后,可以向服务器101发送服务请求,该服务请求指示服务器101从不同方言对应的ASR模型中选择第一方言对应的ASR模型。服务器101接收终端设备102发送的服务请求,之后根据该服务请求的指示从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,以便基于第一方言对应的ASR模型对后续待识别语音信号进行语音识别。在本实施例中,服务器101预先存储有不同方言对应的ASR模型。ASR模型是一种可以把语音信号转换为文本的模型。可选地,一种方言对应的一个ASR模型,或者几种类似的方言也可以对应同一ASR模型,对此不做限定。其中,第一方言对应的ASR模型用于将第一方言的语音信号转换为文本内容。After identifying the first dialect to which the voice wake-up word belongs, the terminal device 102 may send a service request to the server 101, the service request instructing the server 101 to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. The server 101 receives the service request sent by the terminal device 102, and then selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects according to the indication of the service request, so as to perform subsequent speech recognition based on the ASR model corresponding to the first dialect. The signal is speech recognized. In this embodiment, the server 101 stores in advance an ASR model corresponding to different dialects. The ASR model is a model that converts speech signals into text. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
终端设备102在向服务器101发送服务请求后,继续向服务器101发送待识别语音信号,该待识别语音信号属于第一方言。服务器101接收终端设备102发送的待识别语音信号,并根据选择的第一方言对应的ASR模型对待识别语音信号进行语音识别,不仅可以对第一方言进行语音识别,而且采用与之匹配的ASR模型有利于提高语音识别的准确性。After transmitting the service request to the server 101, the terminal device 102 continues to send the to-be-identified voice signal to the server 101, and the to-be-identified voice signal belongs to the first dialect. The server 101 receives the to-be-recognized speech signal sent by the terminal device 102, and performs speech recognition on the speech signal to be recognized according to the ASR model corresponding to the selected first dialect, not only can perform speech recognition on the first dialect, but also adopts the matching ASR model. Conducive to improve the accuracy of speech recognition.
可选地,待识别语音信号可以是用户在输入语音唤醒词后,继续向终端设备102输入的语音信号,基于此,终端设备102在向服务器101发送待识别语音信号之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在终端设备102本地的语音信号,基于此,终端设备102可以直接从本地获取待识别语音信号。Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 102 after inputting the voice wake-up word. Based on this, the terminal device 102 may further receive the user input before transmitting the to-be-identified voice signal to the server 101. The speech signal to be recognized. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device 102, based on which the terminal device 102 may directly acquire the voice signal to be recognized from the local.
在一些示例性实施例中,服务器101可以向终端设备102返回语音识别结果或语音识别结果的关联信息。例如,服务器101可以将语音识别出的文本内容返回给终端设备102;或者,服务器101也可以将与语音识别结果相匹配的歌曲、视频等信息返回给终端设备102。终端设备102接收服务器101返回的语音识别结果或语音识别结果的关联信息,并根据语音识别结果或语音识别结果的关联信息执行后续处理。例如,终端设备102在接收到语音识别出的文本内容之后,可以将文本内容展示给用户,或者可以基于文本 内容进行网络搜索等。又例如,终端设备102在接收到语音识别结果的关联信息,例如歌曲、视频等信息之后,可以播放歌曲、视频等信息,或者也可以将歌曲、视频等信息转发给其它用户,以便实现信息分享。In some exemplary embodiments, the server 101 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 102. For example, the server 101 may return the text content recognized by the voice to the terminal device 102; or the server 101 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 102. The terminal device 102 receives the speech recognition result returned by the server 101 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result. For example, after receiving the text content recognized by the voice, the terminal device 102 may present the text content to the user, or may perform a network search or the like based on the text content. For example, after receiving the associated information of the voice recognition result, such as songs, videos, and the like, the terminal device 102 can play information such as songs and videos, or can forward information such as songs and videos to other users for information sharing. .
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高对多方言语音进行识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
在本申请各实施例中,并不限定终端设备102识别语音唤醒词所属的第一方言的方式,凡是可识别出语音唤醒词所属的第一方言的方式均适用于本申请各实施例。在本申请下面一些示例性实施例中,列举几种终端设备102识别语音唤醒词所属方言的方式:In the embodiments of the present application, the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs is not limited, and any manner in which the first dialect to which the voice wake-up word belongs can be applied to the embodiments of the present application. In some of the following exemplary embodiments of the present application, several ways in which the terminal device 102 recognizes the dialect to which the speech wake-up word belongs are listed:
方式1,终端设备102将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言。In the first mode, the terminal device 102 dynamically matches the voice wake-up words with the reference wake-up words recorded in different dialects, and obtains a dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement. One side.
在方式1中,预先以不同方言录制基准唤醒词。其中,以不同方言录制的基准唤醒词与语音唤醒词的文本内容相同。由于持不同方言的用户的发声机理不同,以不同方言录制的基准关键词的声学特征不同。基于此,终端设备102以不同方言预先录制基准唤醒词,待接收到用户输入的语音唤醒词后,将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,以得到与不同的基准唤醒词的匹配度。其中,根据应用场景的不同,第一设定要求可以不同。例如,可以将与语音唤醒词的匹配度最高的基准唤醒词所对应的方言作为第一方言;或者,也可以设置一匹配度阈值,将与语音唤醒词的匹配度大于匹配度阈值的基准唤醒词所对应的方言作为第一方言;或者也可以设置一匹配度范围,将与语音唤醒词的匹配度落入该匹配度范围内的基准唤醒词所对应的方言作为第一方言。In mode 1, the reference wake-up words are recorded in advance in different dialects. Among them, the reference wake-up words recorded in different dialects are the same as the text content of the voice wake-up words. Due to the different vocalization mechanisms of users with different dialects, the acoustic characteristics of the benchmark keywords recorded in different dialects are different. Based on this, the terminal device 102 pre-records the reference wake-up words in different dialects. After receiving the voice wake-up words input by the user, the voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects to obtain the The matching degree of different benchmark wake-up words. The first setting requirement may be different according to different application scenarios. For example, a dialect corresponding to the reference wake-up word with the highest degree of matching with the voice wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the reference with the voice wake-up word is greater than the matching threshold. The dialect corresponding to the word is used as the first dialect; or a matching degree range may be set, and the dialect corresponding to the reference wake-up word falling within the matching degree range with the matching degree of the voice wake-up word is used as the first dialect.
在方式1中,声学特征可以体现为语音信号的时域特征和频域特征。基于时域特征和频域特征的匹配方法有多种,可选地,可基于动态时间弯折(dynamic time warping,DTW)方法,对语音唤醒词进行时间序列的动态匹配。In mode 1, the acoustic features may be embodied as time domain features and frequency domain features of the speech signal. There are various matching methods based on time domain features and frequency domain features. Alternatively, dynamic matching of the speech wake words can be performed based on dynamic time warping (DTW) methods.
动态时间弯折方法是一种衡量两个时间序列之间的相似度的方法。终端设备102根据输入的语音唤醒词生成语音唤醒词的时间序列,并分别与以不同方言录制的基准唤醒词的时间序列比较。在参与比较的两个时间序列之间,确定至少一对相似点。将相似点之间的距离之和,即归整路径距离,来衡量两个时间序列之间的相似性。可选地,可以将与语音唤醒词的规整路径距离最小的基准唤醒词所对应的方言作为第一方言;也可以设置一距离阈值,将与语音唤醒词的规整路径距离小于距离阈值的基准唤醒词所对应的方言作为第一方言;还可以设置一距离范围,将与语音唤醒词的规整路径距离落入该距离范围内的基准唤醒词所对应的方言作为第一方言。The dynamic time bending method is a method of measuring the similarity between two time series. The terminal device 102 generates a time series of the speech wake-up words according to the input speech wake-up words, and compares them with the time series of the reference wake-up words recorded in different dialects, respectively. At least one pair of similarities is determined between the two time series participating in the comparison. The similarity between two time series is measured by the sum of the distances between similar points, that is, the distance of the normalized path. Optionally, the dialect corresponding to the reference wake-up word with the smallest distance from the regular path of the voice wake-up word may be used as the first dialect; or a distance threshold may be set to wake up the reference with the regular path distance of the voice wake-up word less than the distance threshold The dialect corresponding to the word is used as the first dialect; a distance range may be set, and the dialect corresponding to the reference wake-up word falling within the distance range from the regular path distance of the voice wake-up word is used as the first dialect.
方式2,终端设备102识别语音唤醒词的声学特征,将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言。In the mode 2, the terminal device 102 recognizes the acoustic features of the voice wake-up words, and matches the acoustic features of the voice wake-up words with the acoustic features of different dialects respectively, and obtains a dialect that matches the acoustic features of the voice wake-up words according to the second setting requirement. As the first dialect.
在方式2中,预先获取不同方言的声学特征,通过识别语音唤醒词的声学特征,进而基于声学特征之间的匹配确定语音唤醒词所属的第一方言。In mode 2, the acoustic features of different dialects are acquired in advance, and the first dialect to which the speech wake-up word belongs is determined based on the acoustic features of the awakened words of the speech, and then based on the matching between the acoustic features.
可选地,在识别语音唤醒词的声学特征之前,可以对语音唤醒词进行滤波处理和数字化。滤波处理指保留语音唤醒词中频率在300~3400Hz中的信号。数字化指对保留的信号进行A/D转换及抗混叠处理。Alternatively, the speech wake words may be filtered and digitized prior to identifying the acoustic features of the speech wake words. The filtering process refers to the preservation of the signal in the speech wake-up word with a frequency between 300 and 3400 Hz. Digitization refers to A/D conversion and anti-aliasing processing of reserved signals.
可选地,可以通过计算语音唤醒词的频谱特征参数,例如滑动差分倒谱参数,来识别语音唤醒词的声学特征。与方式1类似,根据应用场景的不同,第二设定要求可以不同。例如,可以将与语音唤醒词的声学特征的匹配度最高的基准唤醒词所对应的方言作为第一方言;也可以设置一匹配度阈值,将与语音唤醒词的声学特征的匹配度大于匹配度阈值的基准唤醒词所对应的方言作为第一方言;还可以设置一匹配度范围,将与语音唤醒词的声学特征的匹配度落入该匹配度范围内的基准唤醒词所对应的方言作为第一方言。Alternatively, the acoustic features of the speech wake words may be identified by calculating spectral feature parameters of the speech wake words, such as sliding differential cepstral parameters. Similar to mode 1, the second setting requirement may be different depending on the application scenario. For example, a dialect corresponding to the reference wake-up word with the highest degree of matching of the acoustic features of the voice wake-up word may be used as the first dialect; a matching degree threshold may also be set, and the matching degree with the acoustic feature of the voice-awaken word is greater than the matching degree. The dialect corresponding to the reference wake-up word of the threshold is used as the first dialect; and a matching range is set, and the dialect corresponding to the reference wake-up word falling within the matching degree range is matched with the dialect of the acoustic feature of the speech wake-up word. One side.
其中,滑动差分倒谱参数由若干块跨多帧语音的差分倒谱组成,考虑了前后帧差分倒谱的影响,融入了较多的时序特征。对比基准唤醒词的滑动差分倒谱参数与以不同方言录制的基准唤醒词的滑动差分倒谱参数,可选地,将与基准唤醒词的滑动差分倒谱参数匹配度最高的基准唤醒词所对应的方言作为第一方言;也可以设置一参数差阈值,将与基准唤醒词的滑动差分倒谱参数之差小于参数差阈值的语音唤醒词所对应的方言作为第一方言;还可以设置一参数差范围,将与基准唤醒词的滑动差分倒谱参数之差落入该参数差范围内的基准唤醒词所对应的方言作为第一方言。Among them, the sliding differential cepstrum parameter is composed of several blocks of differential cepstrum across multiple frames of speech. Considering the influence of frame difference cepstrum before and after, more timing features are incorporated. Comparing the sliding differential cepstrum parameter of the reference wake-up word with the sliding differential cepstrum parameter of the reference wake-up word recorded in different dialects, optionally corresponding to the reference wake-up word with the highest matching degree of the sliding differential cepstral parameter of the reference wake-up word The dialect is used as the first dialect; a parameter difference threshold may also be set, and the dialect corresponding to the speech wake-up word whose difference between the sliding differential cepstral parameters of the reference wake-up words is less than the parameter difference threshold is used as the first dialect; The difference range is a dialect corresponding to the reference wake-up word falling within the parameter difference range from the difference differential cepstral parameter of the reference wake-up word as the first dialect.
方式3,将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In the third mode, the voice wake-up words are converted into text wake-up words, and the text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the reference text wake-up words corresponding to the third set requirement are obtained. The dialect is the first dialect.
在方式3中,文本唤醒词是语音唤醒词经语音识别后转换成的文本,不同方言对应的基准文本唤醒词是不同方言对应的基准唤醒词语音识别后转换成的文本。可选地,对于文本唤醒词和不同方言对应的基准文本唤醒词,可以采用相同的语音识别模型进行粗略语音识别,以提高整个语音识别过程的效率。或者,也可以采用不同方言对应的ASR模型预先对不同方言对应的基准唤醒词进行语音识别后转换为对应的基准文本唤醒词,当接收到语音唤醒词后,可以依次选择一种方言对应的ASR模型,并基于所选择的ASR模型对语音唤醒词进行语音识别以获得文本唤醒词,并将转换后的文本唤醒词与该种方言对应的基准文本唤醒词进行匹配,若该种方言对应的基准文本唤醒词与文本唤醒词的匹配度符合第三设定要求,则将该种方言作为第一方言。反之,若该种方言对应的基准文本唤醒词与文本唤醒词的匹配度不符合第三设定要求,则继续根据下一种方言对应的ASR模型对文本唤醒词进行语音识别后转换为文本唤醒词,并将转换后的文本唤醒词与该种方言对应的基准文本唤醒词进行匹配,直到获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词,并将基准文本唤醒词对应的方言作为语音唤醒词所属的第一方言。In the mode 3, the text wake-up word is a text converted by the voice wake-up word after the voice recognition, and the reference text wake-up word corresponding to the different dialects is the text converted into the reference wake-up speech corresponding to the different dialects. Optionally, for the text wake-up words and the reference text wake-up words corresponding to different dialects, the same speech recognition model may be used for rough speech recognition to improve the efficiency of the entire speech recognition process. Alternatively, the ASR model corresponding to different dialects may be used to perform voice recognition on the reference wake-up words corresponding to different dialects and convert them into corresponding reference text wake-up words. After receiving the voice wake-up words, the ASR corresponding to one dialect may be selected in turn. Model, and based on the selected ASR model, speech recognition of the speech wake-up words to obtain a text wake-up word, and matching the converted text wake-up words with the reference text wake-up words corresponding to the dialect, if the dialect corresponds to the benchmark If the matching degree between the text wake-up word and the text wake-up word meets the third setting requirement, the dialect is used as the first dialect. On the other hand, if the matching degree between the reference text wake-up word and the text wake-up word corresponding to the dialect does not meet the third setting requirement, then the text wake-up word is continuously recognized according to the ASR model corresponding to the next dialect and then converted into text wake-up. a word, and matching the converted text wake-up word with the reference text wake-up word corresponding to the dialect, until obtaining a reference text wake-up word that matches the text wake-up word according to the third setting requirement, and wakes up the reference text The corresponding dialect is used as the first dialect to which the speech wake-up word belongs.
可选地,与方式1、方式2类似,可以将与文本唤醒词的匹配度最高的基准文本唤醒词所对应的方言作为第一方言;也可以设置一匹配度阈值,将与文本唤醒词的匹配度大于匹配度阈值的基准文本唤醒词所对应的方言作为第一方言;还可以设置一匹配度范围,将与文本唤醒词的匹配度落入该匹配度范围内的基准文本唤醒词所对应的方言作为第一方言。Optionally, similar to the mode 1, the mode 2, the dialect corresponding to the reference text wake-up word with the highest matching degree of the text wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the text wake-up word is The dialect corresponding to the reference text wake-up word whose matching degree is greater than the matching degree threshold is used as the first dialect; and a matching degree range may be set, corresponding to the reference text wake-up word falling within the matching degree range with the matching degree of the text wake-up word The dialect is the first dialect.
值得说明的是,第一设定要求、第二设定要求和第三设定要求可以相同,也可以不同。It should be noted that the first setting requirement, the second setting requirement, and the third setting requirement may be the same or different.
在一些示例性实施例中,终端设备102是手机、电脑、穿戴设备等具备显示屏的设备,则可以在显示屏上显示一语音输入界面,通过语音输入界面获取用户输入的文本信息和/或语音信号。可选地,当用户需要进行语音识别时,可以通过按压终端设备的开启按键或者触摸终端设备102的显示屏等方式,向终端设备102发送开启或者激活的指令。终端设备102可响应于激活或开启自身的指令,在显示屏上向用户展示语音输入界面。可选地,语音输入界面上可以展示麦克风的图标或者类似“唤醒词输入”的文本信息, 以指示用户输入语音唤醒词。进而,终端设备102可基于语音输入界面获取用户输入的语音唤醒词。In some exemplary embodiments, the terminal device 102 is a device with a display screen, such as a mobile phone, a computer, a wearable device, etc., and may display a voice input interface on the display screen, and obtain text information input by the user through the voice input interface. voice signal. Alternatively, when the user needs to perform voice recognition, an instruction to turn on or activate may be sent to the terminal device 102 by pressing an open button of the terminal device or a display screen of the touch terminal device 102. The terminal device 102 can present a voice input interface to the user on the display in response to an instruction to activate or turn on itself. Optionally, an icon of the microphone or text information like “Wake Up Word Input” may be displayed on the voice input interface to instruct the user to input the voice wake up word. Further, the terminal device 102 can acquire a voice wake-up word input by the user based on the voice input interface.
在一些示例性实施例中,终端设备102可以是手机、电脑、智能音箱等具备语音播放功能的设备。基于此,终端设备102在向服务器101发送服务请求之后,并且在向服务器101发送待识别语音信号之前,可以输出语音输入提示信息,例如“请说话”、“请点播”等语音信号,以提示用户进行语音输入。对用户来说,在输入语音唤醒词之后,可以在该语音输入提示音的提示下,向终端设备102输入待识别语音信号。终端设备102接收用户输入的待识别语音信号,将待识别语音信号发送给服务器101,由服务器101根据第一方言对应的ASR模型对待识别语音信号进行语音识别。In some exemplary embodiments, the terminal device 102 may be a device with a voice playback function such as a mobile phone, a computer, a smart speaker, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as "please speak", "please order", etc., may be output to prompt The user makes a voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the voice input prompt tone. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
在另一些示例性实施例中,终端设备102可以是手机、电脑、穿戴设备等具备显示屏的设备。基于此,终端设备102在向服务器101发送服务请求之后,并且在向服务器101发送待识别语音信号之前,可以文本或图标等方式展示语音输入提示信息,例如类似“请说话”的文本、麦克风图标等,以提示用户进行语音输入。对用户来说,在输入语音唤醒词之后,可以在该语音输入提示信息的提示下,向终端设备102输入待识别语音信号。终端设备102接收用户输入的待识别语音信号,将待识别语音信号发送给服务器101,由服务器101根据第一方言对应的ASR模型对待识别语音信号进行语音识别。In other exemplary embodiments, the terminal device 102 may be a device having a display screen such as a mobile phone, a computer, a wearable device, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as a text like "speak", a microphone icon, may be displayed in a text or icon. Etc., to prompt the user for voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 under the prompt of the voice input prompt information. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
在又一些示例性实施例中,终端设备102可以具有指示灯。基于此,终端设备102在向服务器101发送服务请求之后,并且在向服务器101发送待识别语音信号之前,可以点亮指示灯,以提示用户进行语音输入。对用户来说,在输入语音唤醒词之后,可以在该指示灯的提示下,向终端设备102输入待识别语音信号。终端设备102接收用户输入的待识别语音信号,将待识别语音信号发送给服务器101,由服务器101根据第一方言对应的ASR模型对待识别语音信号进行语音识别。In still other exemplary embodiments, the terminal device 102 may have an indicator light. Based on this, after the terminal device 102 transmits the service request to the server 101, and before transmitting the voice signal to be recognized to the server 101, the indicator light can be illuminated to prompt the user to perform voice input. For the user, after inputting the voice wake-up word, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the indicator light. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.
值得说明的是,终端设备102可以同时具备语音播放功能、指示灯、显示屏中的至少两种或者三种。基于此,终端设备102可同时以音频方式、以文本或者图标方式以及点亮指示灯的方式中的两种或三种,输出语音输入提示信息,从而加强与用户的互动效果。It should be noted that the terminal device 102 can simultaneously have at least two or three of a voice playing function, an indicator light, and a display screen. Based on this, the terminal device 102 can simultaneously output the voice input prompt information in two or three of an audio manner, a text or an icon manner, and a manner of lighting the indicator light, thereby enhancing the interaction effect with the user.
在一些示例性实施例中,终端设备102在输出语音输入提示音或输出语音输入提示信息或点亮指示灯之前,可以预先确定服务器101已选择第一方言对应的ASR模型,以便于在将用户输入的待识别语音信号发送至服务器101后服务器101可以直接根据已选择的ASR模型对待识别语音信号进行识别。基于此,服务器101在从不同方言对应的 ASR模型中选择第一方言对应的ASR模型之后,向终端设备102返回通知消息,该通知消息用于指示已选择第一方言对应的ASR模型。基于此,终端设备102还可以接收服务器101返回的通知消息,进而基于该通知消息获知服务器101已选择第一方言对应的ASR模型。进而,终端设备102在接收到服务器101返回的通知消息后,可以输出语音输入提示音,或者输出语音输入提示信息,或者点亮指示灯,以提示用户进行语音输入。In some exemplary embodiments, before outputting the voice input prompt tone or outputting the voice input prompt information or lighting the indicator light, the terminal device 102 may predetermine that the server 101 has selected the ASR model corresponding to the first dialect so as to facilitate the user. After the input voice signal to be recognized is transmitted to the server 101, the server 101 can directly recognize the voice signal to be recognized according to the selected ASR model. Based on this, after selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the server 101 returns a notification message to the terminal device 102, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected. Based on this, the terminal device 102 can also receive the notification message returned by the server 101, and further, based on the notification message, learn that the server 101 has selected the ASR model corresponding to the first dialect. Further, after receiving the notification message returned by the server 101, the terminal device 102 may output a voice input prompt tone, or output a voice input prompt message, or light an indicator light to prompt the user to perform voice input.
在本申请各实施例中,服务器101在选择第一方言对应的ASR模型之前,需要构建不同方言对应的ASR模型。其中,服务器101构建不同方言对应的ASR模型的过程主要包括:收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。关于构建每种方言对应的ASR模型的详细过程可参见现有技术,在此不再赘述。In the embodiments of the present application, before selecting the ASR model corresponding to the first dialect, the server 101 needs to construct an ASR model corresponding to different dialects. The process of the server 101 constructing the ASR model corresponding to different dialects mainly includes: collecting corpus of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect corresponding according to acoustic characteristics of different dialects ASR model. For detailed procedures for constructing an ASR model corresponding to each dialect, refer to the prior art, and details are not described herein again.
可选地,可以通过网络收集不同方言的语料,或者也可以对持不同方言的大量用户进行语音录制,从而获得不同方言的语料。Alternatively, the corpus of different dialects may be collected through the network, or a large number of users holding different dialects may be voice recorded to obtain corpus of different dialects.
可选地,在对不同方言的语料进行特征提取之前,可以对收集到的不同方言的语料进行预处理。预处理过程包括对语音进行预加重处理、加窗处理、端点检测处理。对不同方言的语料进行预处理之后,可对语音进行特征提取。语音的特征包括时域特征和频域特征。其中,时域特征包括短时平均能量、短时平均过零率、共振峰、基音周期等,频域特征包括线性预测系数、LPC倒谱系数、线谱对参数、短时频谱、Mel频率倒谱系数等。Optionally, the collected corpus of different dialects may be pre-processed before feature extraction of corpora of different dialects. The preprocessing process includes pre-emphasis processing, windowing processing, and endpoint detection processing on the voice. After preprocessing the corpus of different dialects, feature extraction can be performed on the speech. Features of speech include time domain features and frequency domain features. The time domain features include short-term average energy, short-term average zero-crossing rate, formant, pitch period, etc. The frequency domain features include linear prediction coefficients, LPC cepstral coefficients, line spectrum pair parameters, short-term spectrum, and Mel frequency. Spectral coefficient, etc.
下面,以提取Mel频率倒谱系数为例,说明声学特征提取的过程。首先利用人耳的感知特性,在语音的频谱范围内设置若干个带通滤波器,每个带通滤波器具有三角形或正弦形滤波特性,然后在带通滤波器对语料进行滤波得到的特征矢量中纳入能量信息,计算若干个带通滤波器的信号能量,再通过离散余弦变换计算Mel频率倒谱系数。Next, the process of extracting the acoustic features will be described by taking the Mel frequency cepstrum coefficient as an example. Firstly, using the perceptual characteristics of the human ear, several band-pass filters are set in the spectral range of the speech. Each band-pass filter has a triangular or sinusoidal filtering characteristic, and then the eigenvectors obtained by filtering the corpus in the bandpass filter are used. The energy information is included, the signal energy of several bandpass filters is calculated, and the Mel frequency cepstrum coefficient is calculated by discrete cosine transform.
在得到不同方言的声学特征后,以不同方言的声学特征作为输入,以不同方言的语料对应的文本作为输出,训练不同方言对应的初始模型中的参数,以得到不同方言对应的ASR模型。可选地,ASR模型包括但不限于基于矢量量化法构建的模型、神经网络模型等。After obtaining the acoustic features of different dialects, the acoustic features of different dialects are used as input, and the text corresponding to the corpus of different dialects is output as the output, and the parameters in the initial model corresponding to different dialects are trained to obtain the ASR models corresponding to different dialects. Optionally, the ASR model includes, but is not limited to, a model constructed based on vector quantization, a neural network model, and the like.
下面以多个持不同方言的用户使用终端设备进行点歌的应用场景为例,对上述实施例进行详细说明。In the following, the above embodiment is described in detail by taking an application scenario in which a plurality of dialects use a terminal device to perform a song by using a terminal device as an example.
该具备点歌功能的终端设备可以是智能音箱,可选地,该智能音响具备一显示屏,该智能音箱预设的语音唤醒词是“你好”。当持粤语方言的粤语用户想要点歌时,粤语 用户首先触摸显示屏以输入激活该智能音箱的指令,智能音箱响应于激活终端设备的指令,在显示屏上展示语音输入界面,语音输入界面上显示有“你好”文本。粤语用户向语音输入界面输入“你好”的语音信号。智能音箱基于语音输入界面获取用户输入的“你好”的语音信号,并识别“你好”属于粤语方言;然后,向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择粤语方言对应的ASR模型。服务器接收到服务请求后,选择粤语方言对应的ASR模型,并向智能音箱返回通知消息,该通知消息用于指示已选择粤语方言对应的ASR模型。接着,智能音箱输出语音输入提示信息,例如“请输入语音”,以提示用户进行语音输入。粤语用户在语音输入提示信息的提示下,输入歌曲名“五星红旗”的语音信号。智能音箱接收粤语用户输入的语音信号“五星红旗”,将语音信号“五星红旗”发送给服务器。服务器利用粤语方言对应的ASR模型对语音信号“五星红旗”进行语音识别以获得文本信息“五星红旗”,将与“五星红旗”相匹配的歌曲下发至智能音箱,以供智能音箱播放该歌曲。The terminal device with the song-song function may be a smart speaker. Optionally, the smart speaker has a display screen, and the preset voice wake-up word of the smart speaker is “hello”. When a Cantonese user holding a Cantonese dialect wants to sing a song, the Cantonese user first touches the display screen to input an instruction to activate the smart speaker, and the smart speaker displays a voice input interface on the display screen in response to an instruction to activate the terminal device, and the voice input interface There is a "Hello" text on it. The Cantonese user inputs a "hello" voice signal to the voice input interface. The intelligent speaker acquires the "hello" voice signal input by the user based on the voice input interface, and recognizes that "hello" belongs to the Cantonese dialect; then, sends a service request to the server to request the server to select the Cantonese dialect from the ASR model corresponding to different dialects. Corresponding ASR model. After receiving the service request, the server selects the ASR model corresponding to the Cantonese dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Cantonese dialect has been selected. Then, the smart speaker outputs a voice input prompt message, such as "please input voice" to prompt the user to input voice. The Cantonese user enters the voice signal of the song name “Five Star Red Flag” at the prompt of the voice input prompt message. The intelligent speaker receives the voice signal “Five Star Red Flag” input by the Cantonese user, and sends the voice signal “Five Star Red Flag” to the server. The server uses the ASR model corresponding to the Cantonese dialect to perform speech recognition on the voice signal "five-star red flag" to obtain the text information "five-star red flag", and deliver the song matching the "five-star red flag" to the smart speaker for the smart speaker to play the song. .
同样地,在持粤语方言的粤语用户点歌结束之后,假设持藏语方言的藏语用户想要点歌。此时,藏语用户可以在智能音箱展示的语音输入界面上输入“你好”的语音信号。智能音箱识别“你好”属于藏语方言;然后,向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择藏语方言对应的ASR模型。服务器接收到服务请求后,选择藏语方言对应的ASR模型,并向智能音箱返回通知消息,该通知消息用于指示已选择藏语方言对应的ASR模型。接着,智能音箱输出语音输入提示信息,例如“请输入语音”,以提示用户进行语音输入。藏语用户在语音输入提示信息的提示下,输入歌曲名“我的祖国”的语音信号。智能音箱接收用户输入的语音信号“我的祖国”,并将语音信号“我的祖国”发送给服务器。服务器利用藏语方言对应的ASR模型对语音信号“我的祖国”进行语音识别以获得文本信息“我的祖国”,将与“我的祖国”相匹配的歌曲下发至智能音箱,以供智能音箱播放该歌曲。Similarly, after the end of the Cantonese user's song in Cantonese dialect, it is assumed that Tibetan users with Tibetan dialects want to order songs. At this time, the Tibetan user can input the “hello” voice signal on the voice input interface displayed by the smart speaker. The smart speaker recognizes "Hello" as a Tibetan dialect; then, sends a service request to the server to request the server to select the ASR model corresponding to the Tibetan dialect from the ASR models corresponding to different dialects. After receiving the service request, the server selects the ASR model corresponding to the Tibetan dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Tibetan dialect has been selected. Then, the smart speaker outputs a voice input prompt message, such as "please input voice" to prompt the user to input voice. The Tibetan user enters the voice signal of the song name "My Country" at the prompt of the voice input prompt message. The smart speaker receives the voice signal "My Country" input by the user and sends the voice signal "My Country" to the server. The server uses the ASR model corresponding to the Tibetan dialect to perform speech recognition on the voice signal "My Country" to obtain the text message "My Country", and deliver the song matching "My Country" to the smart speaker for intelligence. The speaker plays the song.
在该应用场景中,采用本申请实施例提供的语音识别方法,当持不同方言的用户采用同一智能音箱点歌时,无需用户手动切换ASR模型,只需以相应方言输入语音唤醒词即可,智能音响可自动识别语音唤醒词所属的方言进而请求服务器启动相应方言对应的ASR模型识别用户点的歌曲名称,在支持多方言自动化点歌的同时,可以提高点歌的效率。In the application scenario, the voice recognition method provided by the embodiment of the present application is used. When a user holding a different dialect uses the same smart speaker to sing a song, the user does not need to manually switch the ASR model, and only needs to input the voice wake-up word in the corresponding dialect. The intelligent sound can automatically recognize the dialect to which the voice wake-up word belongs and then request the server to start the song name corresponding to the ASR model corresponding to the corresponding dialect, and improve the efficiency of the song while supporting the multi-dial automatic song.
图2为本申请另一示例性实施例提供的一种语音识别方法的流程示意图。该实施例可基于图1所示语音识别系统实现,主要是从终端设备的角度进行的描述。如图2所示, 该方法包括:FIG. 2 is a schematic flowchart diagram of a voice recognition method according to another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the terminal device. As shown in Figure 2, the method includes:
21、接收语音唤醒词。21. Receive voice wake-up words.
22、识别语音唤醒词所属的第一方言。22. Identify the first dialect to which the voice wake-up word belongs.
23、向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型。23. Send a service request to the server to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
24、向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。24. The voice signal to be identified is sent to the server, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.
当用户想要进行语音识别时,可向终端设备输入语音唤醒词,该语音唤醒词是指定文本内容的语音信号,例如“开启”、“天猫精灵”、“hello”等。终端设备接收用户输入的语音唤醒词,识别该语音唤醒词所属的方言,进而可确定后续待识别语音信号所属的方言(即该语音唤醒词所属的方言),为采用相应方言对应的ASR模型进行语音识别提供基础。为便于描述和区分,将语音唤醒词所属的方言记为第一方言。When the user wants to perform voice recognition, a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying the text content, such as "on", "Tmall Elf", "hello", and the like. The terminal device receives the voice wake-up word input by the user, identifies a dialect to which the voice wake-up word belongs, and further determines a dialect to which the subsequent voice signal to be recognized belongs (ie, a dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect. Speech recognition provides the foundation. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.
然后,终端设备在识别出语音唤醒词所属的第一方言之后,向服务器发送服务请求,该服务请求指示服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型。接着,终端设备将待识别语音信号发送至服务器。服务器在接收到服务请求后,从不同方言对应的ASR模型中选择第一方言对应的ASR模型,并通过所选择的第一方言对应的ASR模型对接收到的待识别语音信号进行识别。Then, after identifying the first dialect to which the voice wake-up word belongs, the terminal device sends a service request to the server, and the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. Then, the terminal device transmits the to-be-identified voice signal to the server. After receiving the service request, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, and identifies the received voice signal to be recognized through the selected ASR model corresponding to the first dialect.
本实施例中,终端设备识别语音唤醒词所属的第一方言,并向服务器发送服务请求,以使服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型,便于基于第一方言对应的ASR模型对后续待识别语音信号进行语音识别,实现了多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一提高对待识别语音进行识别的效率。In this embodiment, the terminal device identifies the first dialect to which the voice wake-up word belongs, and sends a service request to the server, so that the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the correspondence based on the first dialect. The ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-language speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and faster to implement, and is beneficial to improve. The efficiency of multi-uttered speech recognition. Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. Further improve the efficiency of identifying the recognized speech.
在一些示例性实施例中,上述识别语音唤醒词所属的第一方言的一种方式包括:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言。或者,上述识别语音唤醒词所属的第一方言的另一种方式包括:将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求 的方言作为第一方言。或者,上述识别语音唤醒词所属的第一方言的又一种方式包括:或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In some exemplary embodiments, one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect. Alternatively, another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects respectively, and obtaining the matching degree with the acoustic features of the voice wake-up words according to the second Set the required dialect as the first dialect. Alternatively, the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and matching the text wake-up word with the reference text wake-up word corresponding to different dialects respectively, acquiring and text The dialect of the wake-up word conforms to the dialect corresponding to the reference text wake-up word of the third setting requirement as the first dialect.
在一些示例性实施例中,上述接收语音唤醒词的一种方式包括:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In some exemplary embodiments, one manner of receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
在一些示例性实施例中,在向服务器发送待识别语音信号之前,该方法还包括:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In some exemplary embodiments, before transmitting the to-be-identified voice signal to the server, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the voice signal to be recognized input by the user.
在一些示例性实施例中,在输出语音输入提示信息之前,该方法还包括:接收服务器返回的通知消息,该通知消息用于指示已选择第一方言对应的ASR模型。In some exemplary embodiments, before outputting the voice input prompt information, the method further includes: receiving a notification message returned by the server, the notification message being used to indicate that the ASR model corresponding to the first dialect has been selected.
图3为本申请又一示例性实施例提供的另一种语音识别方法的流程示意图。该实施例可基于图1所示语音识别系统实现,主要是从服务器的角度进行的描述。如图3所示,该方法包括:FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the server. As shown in FIG. 3, the method includes:
31、接收终端设备发送的服务请求,该服务请求指示选择第一方言对应的ASR模型。31. Receive a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected.
32、从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,第一方言是语音唤醒词所属的方言。32. From the ASR model corresponding to different dialects, select the ASR model corresponding to the first dialect, and the first dialect is the dialect to which the speech wake-up word belongs.
33、接收终端设备发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。33. Receive a to-be-identified speech signal sent by the terminal device, and perform speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.
在本实施例中,终端设备在识别出语音唤醒词所属的第一方言后,向服务器发送服务请求。服务器根据服务请求,从预先存储的不同方言对应的ASR模型中,选择第一方言对应的ASR模型,进而可基于第一方言对应的ASR模型为后续语音信号进行语音识别,实现了多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, after identifying the first dialect to which the voice wake-up word belongs, the terminal device sends a service request to the server. According to the service request, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing multi-utter speech recognition. The automation, and based on the voice wake-up words automatically select the ASR model of the corresponding dialect, without the user manual operation, the implementation is more convenient and fast, and is conducive to improving the efficiency of multi-dial speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高多方言语音识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
在一些示例性实施例中,服务器在选择第一方言对应的ASR模型之前,需要构建不同方言对应的ASR模型。其中,一种构建不同方言对应的ASR模型的过程主要包括: 收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In some exemplary embodiments, the server needs to construct an ASR model corresponding to different dialects before selecting the ASR model corresponding to the first dialect. Among them, a process of constructing ASR models corresponding to different dialects mainly includes: collecting corpora of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect correspondence according to acoustic characteristics of different dialects ASR model.
在一些示例性实施例中,在基于第一方言对应的ASR模型对待识别语音信号进行语音识别之后,可以将语音识别结果或语音识别结果的关联信息发送给终端设备,以供终端设备基于语音识别结果或语音识别结果的关联信息执行后续处理。In some exemplary embodiments, after the speech recognition of the speech signal to be recognized based on the ASR model corresponding to the first dialect, the speech recognition result or the association information of the speech recognition result may be transmitted to the terminal device for the terminal device to be based on the speech recognition. The result or the associated information of the speech recognition result performs subsequent processing.
图4为本申请又一示例性实施例提供的另一种语音识别系统的结构示意图。如图4示,该语音识别系统400包括:服务器401和终端设备402。服务器401与终端设备402之间通信连接。FIG. 4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application. As shown in FIG. 4, the speech recognition system 400 includes a server 401 and a terminal device 402. A communication connection is made between the server 401 and the terminal device 402.
本实施例提供的语音识别系统400的架构与图1示出的语音识别系统100的架构相同,区别在于服务器401和终端设备402在语音识别过程中的功能有所不同。关于图4中终端设备402和服务器401的实现形式以及通信连接方式可参见图1所示实施例的描述,在此不再赘述。The architecture of the speech recognition system 400 provided in this embodiment is the same as that of the speech recognition system 100 shown in FIG. 1, except that the functions of the server 401 and the terminal device 402 in the speech recognition process are different. For the implementation of the terminal device 402 and the server 401 in FIG. 4 and the communication connection manner, refer to the description of the embodiment shown in FIG. 1 , and details are not described herein again.
与图1所示语音识别系统100类似,在图4所示语音识别系统400中,终端设备402与服务器401相互配合,也可以向用户提供语音识别功能。而且,考虑到在一些情况下,终端设备402可能会被多个用户使用,多个用户可能持不同方言,于是,在语音识别系统400中,也针对不同方言分别构建ASR模型,进而,基于终端设备402与服务器401之间的相互配合,可以向持不同方言的用户提供语音识别功能,即可以对持不同方言的用户的语音信号进行语音识别。Similar to the speech recognition system 100 shown in FIG. 1, in the speech recognition system 400 shown in FIG. 4, the terminal device 402 and the server 401 cooperate with each other, and the speech recognition function can also be provided to the user. Moreover, considering that in some cases, the terminal device 402 may be used by multiple users, multiple users may hold different dialects, and thus, in the voice recognition system 400, the ASR model is separately constructed for different dialects, and further, based on the terminal. The cooperation between the device 402 and the server 401 can provide a voice recognition function to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.
在图4所示语音识别系统400中,终端设备402也支持语音唤醒词功能,但终端设备402主要用于接收用户输入的语音唤醒词并上报给服务器401以供服务器401识别语音唤醒词所属的方言,这点不同于图1所示实施例中的终端设备102。相应地,在图4所示语音识别系统400中,服务器401除了面向不同方言提供ASR模型并选择相应ASR模型对相应方言下的语音信号进行语音识别之外,还具有识别语音唤醒词所属方言的功能。In the voice recognition system 400 shown in FIG. 4, the terminal device 402 also supports the voice wake-up word function, but the terminal device 402 is mainly used to receive the voice wake-up words input by the user and report it to the server 401 for the server 401 to identify the voice wake-up word. In the dialect, this is different from the terminal device 102 in the embodiment shown in FIG. Correspondingly, in the speech recognition system 400 shown in FIG. 4, the server 401 provides an ASR model for different dialects and selects a corresponding ASR model for speech recognition of the speech signal in the corresponding dialect, and also has a dialect for identifying the speech awakening word. Features.
基于图4所示语音识别系统400,当用户想要进行语音识别时,可以向终端设备402输入语音唤醒词,该语音唤醒词是指定文本内容的语音信号,例如“开启”、“天猫精灵”、“hello”等。终端设备402接收用户输入的语音唤醒词,并将该语音唤醒词发送至服务器401。服务器401接收到终端设备402发送的语音唤醒词后,识别该语音唤醒词所属的方言。为便于描述和区分,将语音唤醒词所属的方言记为第一方言。其中,第一方言指语音唤醒词所属的方言,例如可以是官话方言、晋语或湘语等。然后,服务器 401从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,以便于后续基于第一方言对应的ASR模型对第一方言下的语音信号进行语音识别。在本实施例中,服务器401预先存储有不同方言对应的ASR模型。可选地,一种方言对应的一个ASR模型,或者几种类似的方言也可以对应同一ASR模型,对此不做限定。其中,第一方言对应的ASR模型用于将第一方言的语音信号转换为文本内容。Based on the voice recognition system 400 shown in FIG. 4, when the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device 402, and the voice wake-up word is a voice signal specifying the text content, such as "open", "Tmall Elf" ", "hello" and so on. The terminal device 402 receives the voice wake-up word input by the user, and transmits the voice wake-up word to the server 401. After receiving the voice wake-up word sent by the terminal device 402, the server 401 identifies the dialect to which the voice wake-up word belongs. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect. The first dialect refers to a dialect to which the awakening word belongs, and may be, for example, a Mandarin dialect, a Jin dialect or a Xiang dialect. Then, the server 401 selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the voice signals in the first dialect based on the ASR model corresponding to the first dialect. In this embodiment, the server 401 stores in advance an ASR model corresponding to different dialects. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
终端设备402在向服务器401发送语音唤醒词后,继续向服务器401发送待识别语音信号。服务器401接收终端设备402发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。可选地,待识别语音信号可以是用户在输入语音唤醒词后,继续向终端设备402输入的语音信号,基于此,终端设备402在向服务器401发送待识别语音信号之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在终端设备402本地的语音信号。After transmitting the voice wake-up word to the server 401, the terminal device 402 continues to transmit the to-be-identified voice signal to the server 401. The server 401 receives the to-be-identified speech signal sent by the terminal device 402, and performs speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect. Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 402 after inputting the voice wake-up word. Based on this, the terminal device 402 may further receive the user input before transmitting the to-be-identified voice signal to the server 401. The speech signal to be recognized. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device 402.
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高多方言语音识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
在一些示例性实施例中,服务器401识别语音唤醒词所属的第一方言的一种方式包括:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言。In some exemplary embodiments, the manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquires and wakes up the voice. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
在另一些示例性实施例中,服务器401识别语音唤醒词所属的第一方言的另一种方式包括:将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言。In other exemplary embodiments, another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects, respectively, and acquiring the awakened words with the voice The dialect of the acoustic feature matches the dialect of the second setting requirement as the first dialect.
在又一些示例性实施例中,服务器401识别语音唤醒词所属的第一方言的又一种方式包括:将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In still another exemplary embodiment, another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and awakening the text wake-up word to the reference text corresponding to the different dialect respectively The words are matched, and the dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement is obtained as the first dialect.
其中,服务器401识别语音唤醒词所属的第一方言的方式与终端设备102识别语音 唤醒词所属的第一方言的方式类似,详细描述可参见前述实施例,在此不再赘述。The manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs is similar to the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs. For detailed description, refer to the foregoing embodiment, and details are not described herein again.
在一些示例性实施例中,终端设备402接收语音唤醒词的方式包括:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In some exemplary embodiments, the manner in which the terminal device 402 receives the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; acquiring a voice wake-up word input by the user based on the voice input interface.
在一些示例性实施例中,终端设备402在向服务器401发送待识别语音信号之前,可以输出语音输入提示信息,以提示用户进行语音输入;之后,接收用户输入的待识别语音信号。In some exemplary embodiments, before transmitting the to-be-identified voice signal to the server 401, the terminal device 402 may output voice input prompt information to prompt the user to perform voice input; and thereafter, receive the voice signal to be recognized input by the user.
在一些示例性实施例中,终端设备402在输出语音输入提示信息之前,可以接收服务器401返回的通知消息,该通知消息用于指示已选择第一方言对应的ASR模型。基于此,终端设备402可以在确定服务器401已选择第一方言对应的ASR模型之后,向用户输出语音输入提示信息,以提示用户进行语音输入,这样可以在将用户输入的待识别语音信号发送至服务器401后服务器401可以直接根据已选择的ASR模型对待识别语音信号进行识别。In some exemplary embodiments, before outputting the voice input prompt information, the terminal device 402 may receive a notification message returned by the server 401, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected. Based on this, after determining that the server 401 has selected the ASR model corresponding to the first dialect, the terminal device 402 may output voice input prompt information to the user to prompt the user to perform voice input, so that the voice signal to be recognized input by the user may be sent to The server 401 after the server 401 can directly recognize the speech signal to be recognized according to the selected ASR model.
在一些示例性实施例中,服务器401在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,可以收集不同方言的预料;对不同方言的预料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。关于构建每种方言对应的ASR模型的详细过程可参见现有技术,在此不再赘述。In some exemplary embodiments, the server 401 may collect the predictions of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extract the features of different dialects to obtain different dialects. Acoustic characteristics; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects. For detailed procedures for constructing an ASR model corresponding to each dialect, refer to the prior art, and details are not described herein again.
在一些示例性实施例中,服务器401可以向终端设备402返回语音识别结果或语音识别结果的关联信息。例如,服务器401可以将语音识别出的文本内容返回给终端设备402;或者,服务器401也可以将与语音识别结果相匹配的歌曲、视频等信息返回给终端设备402。终端设备402接收服务器401返回的语音识别结果或语音识别结果的关联信息,并根据语音识别结果或语音识别结果的关联信息执行后续处理。In some exemplary embodiments, the server 401 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 402. For example, the server 401 may return the text content recognized by the voice to the terminal device 402; or the server 401 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 402. The terminal device 402 receives the speech recognition result returned by the server 401 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result.
图5为本申请又一示例性实施例提供的又一种语音识别方法的流程示意图。该实施例可基于图4所示语音识别系统实现,主要是从终端设备的角度进行的描述。如图5所示,该方法包括:FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in FIG. 4, mainly from the perspective of the terminal device. As shown in FIG. 5, the method includes:
51、接收语音唤醒词。51. Receive a speech wake up word.
52、向服务器发送语音唤醒词,以供服务器基于语音唤醒词从不同方言对应的ASR模型中选择语音唤醒词所属第一方言对应的ASR模型。52. Send a voice wake-up word to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR model corresponding to different dialects based on the voice wake-up word.
53、向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。53. Send a voice signal to be identified to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
当用户想要进行语音识别时,可以向终端设备输入语音唤醒词,该语音唤醒词是指定文本内容的语音信号,例如“开启”、“天猫精灵”、“hello”等。终端设备接收用户发送的语音唤醒词,并向服务器发送语音唤醒词,以供服务器识别该语音唤醒词所属的方言,进而可确定后续待识别语音信号所属的方言(即该语音唤醒词所属的方言),为采用相应方言对应的ASR模型进行语音识别提供基础。为便于描述和区分,将语音唤醒词所属的方言记为第一方言。When the user wants to perform voice recognition, a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying a text content, such as "on", "Tmall Elf", "hello", and the like. The terminal device receives the voice wake-up word sent by the user, and sends a voice wake-up word to the server, so that the server identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs ), to provide a basis for speech recognition using the ASR model corresponding to the corresponding dialect. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.
然后,服务器根据语音唤醒词所属的第一方言,从不同方言对应的ASR模型中选择语音唤醒词所属第一方言对应的ASR模型。接着,终端设备,继续向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。Then, the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR model corresponding to the different dialects according to the first dialect to which the voice wake-up word belongs. Then, the terminal device continues to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
在一些示例性实施例中,上述接收语音唤醒词包括:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In some exemplary embodiments, the receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
在一些示例性实施例中,在向服务器发送待识别语音信号之前,该方法还包括:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In some exemplary embodiments, before transmitting the to-be-identified voice signal to the server, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the voice signal to be recognized input by the user.
在一些示例性实施例中,在输出语音输入提示信息之前,该方法还包括:接收服务器返回的通知消息,通知消息用于指示已选择第一方言对应的ASR模型。In some exemplary embodiments, before outputting the voice input prompt information, the method further includes: receiving a notification message returned by the server, the notification message being used to indicate that the ASR model corresponding to the first dialect has been selected.
图6为本申请又一示例性实施例提供的又一种语音识别方法的流程示意图。该实施例可基于图4所示语音识别系统实现,主要是从服务器的角度进行的描述。如图6所示,该方法包括:FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 4, mainly from the perspective of the server. As shown in FIG. 6, the method includes:
61、接收终端设备发送的语音唤醒词。61. Receive a voice wake-up word sent by the terminal device.
62、识别语音唤醒词所属的第一方言。62. Identify a first dialect to which the voice wake-up word belongs.
63、从不同方言对应的ASR模型中,选择第一方言对应的ASR模型。63. Select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
64、接收终端设备发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。64. Receive a voice signal to be recognized sent by the terminal device, and perform voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
服务器接收终端设备发送的语音唤醒词,识别该语音唤醒词所属的方言,进而可确定后续待识别语音信号所属的方言(即该语音唤醒词所属的方言),为采用相应方言对 应的ASR模型进行语音识别提供基础。为便于描述和区分,将语音唤醒词所属的方言记为第一方言。The server receives the voice wake-up word sent by the terminal device, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent to-be-identified voice signal belongs (that is, the dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect. Speech recognition provides the foundation. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.
然后,服务器从预先存储的不同方言对应的ASR模型中,选择第一方言对应的ASR模型,进而可基于第一方言对应的ASR模型为后续语音信号进行语音识别,实现了多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。Then, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing the automation of multi-utter speech recognition. And automatically select the ASR model of the corresponding dialect based on the voice wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to improve the efficiency of multi-dial speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高多方言语音识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
在一些示例性实施例中,上述识别语音唤醒词所属的第一方言的一种方式包括:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言。In some exemplary embodiments, one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
在另一些示例性实施例中,上述识别语音唤醒词所属的第一方言的另一种方式包括:将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言。In other exemplary embodiments, another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching acoustic features of the voice wake-up words with acoustic features of different dialects respectively, and acquiring acoustics with the voice wake-up words The dialect of the feature matching the second setting requirement is used as the first dialect.
在又一些示例性实施例中,上述识别语音唤醒词所属的第一方言的又一种方式包括:将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In still another exemplary embodiment, the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and the text wake-up word respectively corresponding to the dialect word corresponding to different dialects The matching is performed to obtain a dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement as the first dialect.
在一些示例性实施例中,在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,该方法还包括:收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In some exemplary embodiments, before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain Acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
在一些示例性实施例中,服务器可以向终端设备返回语音识别结果或语音识别结果的关联信息。例如,服务器可以将语音识别出的文本内容返回给终端设备;或者,也可以将与语音识别结果相匹配的歌曲、视频等信息返回给终端设备。In some exemplary embodiments, the server may return the speech recognition result or the association information of the speech recognition result to the terminal device. For example, the server may return the text content recognized by the voice to the terminal device; or, the song, the video, and the like that match the voice recognition result may be returned to the terminal device.
在上述各实施例中,由终端设备和服务器配合执行多方言的语音识别,但并不限于此。例如,若终端设备或者服务器的处理功能与存储功能足够强大,则可将多方言语音识别功能单独集成于终端设备或者服务器上实现。基于此,本申请又一示例性实施例提供一种由服务器或终端设备独立实施的语音识别方法。为了描述简便,在下述实施例中,将服务器和终端设备统一称为电子设备。如图7所示,由服务器或终端设备独立实施的 语音识别方法包括以下步骤:In the above embodiments, the speech recognition of the multi-dial is performed by the terminal device and the server, but is not limited thereto. For example, if the processing function and the storage function of the terminal device or the server are sufficiently powerful, the multi-word speech recognition function can be separately integrated on the terminal device or the server. Based on this, still another exemplary embodiment of the present application provides a voice recognition method independently implemented by a server or a terminal device. For simplicity of description, in the following embodiments, the server and the terminal device are collectively referred to as an electronic device. As shown in FIG. 7, the voice recognition method independently implemented by the server or the terminal device includes the following steps:
71、接收语音唤醒词。71. Receive a speech wake-up word.
72、识别语音唤醒词所属的第一方言。72. Identify a first dialect to which the voice wake-up word belongs.
73、从不同方言对应的ASR模型中选择第一方言对应的ASR模型。73. Select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
74、利用第一方言对应的ASR模型对待识别语音信号进行语音识别。74. Perform speech recognition by using the ASR model corresponding to the first dialect to identify the speech signal.
当用户想要进行语音识别时,可以向电子设备输入语音唤醒词,该语音唤醒词是指定文本内容的语音信号,例如“开启”、“天猫精灵”、“hello”等。电子设备接收用户发送的语音唤醒词,并识别语音唤醒词所属的第一方言。其中,第一方言指语音唤醒词所属的方言,例如官话方言、晋语、湘语等。When the user wants to perform voice recognition, a voice wake-up word may be input to the electronic device, and the voice wake-up word is a voice signal specifying the text content, such as "on", "Tmall Elf", "hello", and the like. The electronic device receives the voice wake-up word sent by the user, and identifies the first dialect to which the voice wake-up word belongs. Among them, the first dialect refers to the dialect to which the awakening words of speech belong, such as Mandarin dialect, Jin dialect, Xiang dialect and so on.
接着,电子设备从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,以便基于第一方言对应的ASR模型对后续待识别语音信号进行语音识别。在本实施例中,电子设备预先存储有不同方言对应的ASR模型。可选地,一种方言对应的一个ASR模型,或者几种类似的方言也可以对应同一ASR模型,对此不做限定。其中,第一方言对应的ASR模型用于将第一方言的语音信号转换为文本内容。Then, the electronic device selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the subsequent to-be-identified voice signals based on the ASR model corresponding to the first dialect. In this embodiment, the electronic device stores in advance an ASR model corresponding to different dialects. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.
电子设备在选择第一方言对应的ASR模型后,会利用第一方言对应的ASR模型对待识别语音信号进行语音识别。可选地,待识别语音信号可以是用户在输入语音唤醒词后,继续向电子设备输入的语音信号,基于此,电子设备在利用第一方言对应的ASR模型对待识别语音信号进行语音识别之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在电子设备本地的语音信号,基于此,电子设备可以直接从本地获取待识别语音信号。After selecting the ASR model corresponding to the first dialect, the electronic device uses the ASR model corresponding to the first dialect to perform speech recognition on the speech signal to be recognized. Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the electronic device after inputting the voice wake-up word, based on which the electronic device performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect. It is also possible to receive a speech signal to be recognized input by the user. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the electronic device, based on which the electronic device may directly obtain the voice signal to be recognized from the local.
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高多方言语音识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
在一些示例性实施例中,上述识别语音唤醒词所属的第一方言的一种方式包括:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音 唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言。In some exemplary embodiments, one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.
在另一些示例性实施例中,上述识别语音唤醒词所属的第一方言的另一种方式包括:将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言。In other exemplary embodiments, another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching acoustic features of the voice wake-up words with acoustic features of different dialects respectively, and acquiring acoustics with the voice wake-up words The dialect of the feature matching the second setting requirement is used as the first dialect.
在又一些示例性实施例中,上述识别语音唤醒词所属的第一方言的又一种方式包括:将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In still another exemplary embodiment, the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and the text wake-up word respectively corresponding to the dialect word corresponding to different dialects The matching is performed to obtain a dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement as the first dialect.
在一些示例性实施例中,上述接收语音唤醒词,包括:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In some exemplary embodiments, the receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.
在一些示例性实施例中,在利用第一方言对应的ASR模型对待识别语音信号进行语音识别之前,该方法还包括:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In some exemplary embodiments, before performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the user input to be recognized. voice signal.
在一些示例性实施例中,在从不同方言对应的ASR模型中选择第一方言对应的ASR模型之前,该方法还包括:收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In some exemplary embodiments, before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain different The acoustic characteristics of dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
在一些示例性实施例中,在基于第一方言对应的ASR模型对待识别语音信号进行语音识别之后,电子设备可以基于语音识别结果或语音识别结果的关联信息执行后续处理。In some exemplary embodiments, after performing speech recognition on the speech signal to be recognized based on the ASR model corresponding to the first dialect, the electronic device may perform subsequent processing based on the speech recognition result or the association information of the speech recognition result.
值得说明的是,在本申请上述实施例或下述实施例中,语音唤醒词可以是预置的;或者,也可以允许用户自定义唤醒词。这里自定义唤醒词或预置唤醒词主要是指唤醒词的内容和/或声调等。其中,自定义语音唤醒词的功能可由终端设备来实现,也可以由服务器来实现。可选地,可由识别语音唤醒词所属方言的设备提供自定义语音唤醒词的功能。It should be noted that, in the above embodiment or the following embodiments of the present application, the voice wake-up word may be preset; or, the user may be allowed to customize the wake-up word. Here, the custom wake-up word or the preset wake-up word mainly refers to the content and/or tone of the wake-up word. The function of the custom voice wake-up word can be implemented by the terminal device or by the server. Alternatively, the function of the custom speech wake-up word may be provided by a device that recognizes the dialect to which the speech wake-up word belongs.
以终端设备提供自定义唤醒词的功能为例,终端设备可以向用户提供一种自定义唤醒词的入口。该入口可以实现为一物理按钮,基于此,用户可以点击该物理按钮触发唤醒词自定义操作。或者,该入口可以是终端设备的设置选项中的唤醒词自定义子项,基于此,用户可以进入终端设备的设置选项,然后针对该唤醒词自定义子项进行点击、悬停或长按等操作,从而触发唤醒词自定义操作。无论用户通过何种方式触发唤醒词自定义操作,对终端设备来说,可响应于唤醒词自定义操作,接收用户输入的自定义语音信号,并将接收到的自定义语音信号保存为语音唤醒词。可选地,终端设备可以向用户展 示一音频录入页面,以录制用户发出的自定义语音信号。例如,用户在触发唤醒词自定义操作后,终端设备向用户展示音频录入页面,此时,用户可以输入语音信号“你好”,则终端设备接收到语音信号“你好”后会将语音信号“你好”设置为语音唤醒词。可选地,终端设备可以维护一唤醒词库,将用户自定义的语音唤醒词保存至唤醒词库中。Taking the function of providing a custom wake-up word by the terminal device as an example, the terminal device can provide the user with an entry for a custom wake-up word. The portal can be implemented as a physical button, based on which the user can click on the physical button to trigger a wake-up word customization operation. Alternatively, the entry may be a wake-up word customization sub-item in a setting option of the terminal device, based on which the user may enter a setting option of the terminal device, and then click, hover or long-press for the wake-up word customization sub-item, etc. Operation, which triggers a wake-up word custom action. Regardless of the manner in which the user triggers the wake-up word custom operation, the terminal device can receive the customized voice signal input by the user in response to the wake-up word custom operation, and save the received custom voice signal as a voice wake-up. word. Optionally, the terminal device can display an audio entry page to the user to record a customized voice signal sent by the user. For example, after the user triggers the wake-up word customization operation, the terminal device displays the audio input page to the user. At this time, the user can input the voice signal “hello”, and the terminal device will receive the voice signal after receiving the voice signal “hello”. "Hello" is set to the voice wake up word. Optionally, the terminal device may maintain a wake-up vocabulary and save the user-defined voice wake-up words to the wake-up vocabulary.
可选地,语音唤醒词不宜过长,以降低识别所属方言时的难度,但也不宜过短。语音唤醒词过短,辨识度不高,容易造成误唤醒。例如,语音唤醒词可以在3至5个字符之间,但不限于此。这里的1个字符是指1个汉字、也可以是1个英文字母。Optionally, the voice wake-up word should not be too long to reduce the difficulty in identifying the dialect, but it should not be too short. The speech wake-up word is too short, the recognition is not high, and it is easy to cause false wake-up. For example, the voice wake-up word can be between 3 and 5 characters, but is not limited thereto. The one character here refers to one Chinese character or one English letter.
可选地,在自定义唤醒词时,可以选择易于区分的词,而不宜选用较为常用的词,以降低应用被误唤醒的几率。Optionally, when customizing the wake-up words, you can select words that are easy to distinguish, and you should not use more common words to reduce the chance that the application will be awakened by mistake.
在本申请另一些实施例中,语音唤醒词主要用于唤醒或激活应用的语音识别功能,可以不限定语音唤醒词所属的方言,即用户可以采用任意方言或普通话来发出语音唤醒词。用户在发出语音唤醒词之后,可以再发出一具有方言指示意义的语音信号,例如该语音信号可以是内容为“天津话”、“河南话”、“启用闽南方言”等的语音信号。然后,可从用户发出的具有方言指示意义的语音信号中解析出需要进行语音识别的方言,进而从不同方言对应的ASR模型中选择与所解析出的方言对应的ASR模型,并基于所选择的ASR模型进行对后续待识别语音信号进行语音识别。为便于区分和描述,将这里具有方言指示意义的语音信号称为第一语音信号,将从所述第一语音信号中解析出的方言称为第一方言。In other embodiments of the present application, the voice wake-up word is mainly used to wake up or activate the voice recognition function of the application, and may not define the dialect to which the voice wake-up word belongs, that is, the user may use any dialect or Mandarin to issue the voice wake-up word. After the voice wake-up word is issued, the user may re-issue a voice signal having a dialect indicating meaning, for example, the voice signal may be a voice signal whose contents are "Tianjin dialect", "Henan dialect", "enable Minnan dialect", and the like. Then, the dialect that needs speech recognition can be parsed from the voice signal with the dialect indicating meaning sent by the user, and then the ASR model corresponding to the parsed dialect is selected from the ASR models corresponding to different dialects, and based on the selected The ASR model performs speech recognition on subsequent speech signals to be recognized. For convenience of distinction and description, a speech signal having a dialect indicating meaning herein is referred to as a first speech signal, and a dialect parsed from the first speech signal is referred to as a first dialect.
其中,凡是具有方言指导意义的语音信号均可以作为本申请实施例中的第一语音信号。例如,第一语音信号可以是用户以第一方言发出的语音信号,从而可基于第一语音信号的声学特征识别第一方言。或者,第一语音信号可以是包含第一方言的名称的语音信号,例如在语音信号“请启用闽南话模型”中,“闽南话”记即为第一方言的名称。基于此,可以从第一语音信号中提取第一方言的名称对应的音素片段,进而识别出第一方言。The voice signal having the dialect guiding meaning can be used as the first voice signal in the embodiment of the present application. For example, the first speech signal may be a speech signal emitted by the user in the first dialect such that the first dialect may be identified based on the acoustic characteristics of the first speech signal. Alternatively, the first voice signal may be a voice signal containing the name of the first dialect, for example, in the voice signal "Please enable the Minnan dialect model", the "Minnan dialect" is the name of the first dialect. Based on this, the phoneme segment corresponding to the name of the first dialect can be extracted from the first voice signal, thereby identifying the first dialect.
上述结合语音唤醒词和第一语音信号的语音识别方法可由终端设备和服务器相互配合实施,也可以由终端设备或服务器独立实施。下面将针对不同实施方式分别进行说明:The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal may be implemented by the terminal device and the server, or may be implemented independently by the terminal device or the server. The following will explain the different implementations separately:
方式A:上述结合语音唤醒词和第一语音信号的语音识别方法由终端设备和服务器相互配合实施。在方式A中,终端设备支持语音唤醒功能,当用户想要进行语音识别时,可以向终端设备输入语音唤醒词,以唤醒语音识别功能。终端设备接收语音唤醒词,以 唤醒语音识别功能。然后,用户向终端设备输入具有方言指导意义的第一语音信号;终端设备接收用户输入的第一语音信号后,从第一语音信号中解析出需要进行语音识别的第一方言,即后续待识别语音信号所属的方言,从而为采用相应方言对应的ASR模型进行语音识别提供基础。Mode A: The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server. In mode A, the terminal device supports a voice wake-up function. When the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device to wake up the voice recognition function. The terminal device receives the voice wake-up word to wake up the voice recognition function. Then, the user inputs a first voice signal having a dialect guiding meaning to the terminal device; after receiving the first voice signal input by the user, the terminal device parses the first dialect that needs voice recognition from the first voice signal, that is, the subsequent to be recognized The dialect to which the speech signal belongs, thereby providing a basis for speech recognition using the corresponding ASR model of the dialect.
终端设备在从第一语音信号中解析出第一方言后,向服务器发送服务请求,该服务请求指示服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型。服务器接收终端设备发送的服务请求之后,根据该服务请求的指示从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,以便基于第一方言对应的ASR模型对后续待识别语音信号进行语音识别。After parsing the first dialect from the first voice signal, the terminal device sends a service request to the server, where the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. After receiving the service request sent by the terminal device, the server selects the ASR model corresponding to the first dialect from the ASR model corresponding to the different dialects according to the indication of the service request, so as to perform the subsequent to-be-identified voice signal based on the ASR model corresponding to the first dialect. Speech Recognition.
终端设备在向服务器发送服务请求后,继续向服务器发送待识别语音信号,该待识别语音信号属于第一方言。服务器接收终端设备发送的待识别语音信号,并根据选择的第一方言对应的ASR模型对待识别语音信号进行语音识别。对待识别语音信号而言,采用与之匹配的ASR模型进行语音识别,有利于提高语音识别的准确性。After transmitting the service request to the server, the terminal device continues to send the to-be-identified voice signal to the server, where the to-be-identified voice signal belongs to the first dialect. The server receives the to-be-identified voice signal sent by the terminal device, and performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the selected first dialect. For the recognition of speech signals, the matching ASR model for speech recognition is beneficial to improve the accuracy of speech recognition.
可选地,待识别语音信号可以是用户在输入第一语音信号后,继续向终端设备输入的语音信号,基于此,终端设备在向服务器发送待识别语音信号之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在终端设备本地的语音信号。Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and based on this, the terminal device may further receive the user input before sending the to-be-identified voice signal to the server. Identify the voice signal. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.
在一些示例性实施例中,语音唤醒词主要用于唤醒终端设备的语音识别功能;而后续需要进行语音识别的第一方言可由第一语音信号提供。基于此,可以不用限定用户发出语音唤醒词所使用的语言方式。例如,用户可以使用普通话发出语音唤醒词,或者也可以使用第一方言发出语音唤醒词,或者还可以使用不同于第一方言的其它方言发出语音唤醒词。In some exemplary embodiments, the speech wake-up word is primarily used to wake up the speech recognition function of the terminal device; and the first dialect that subsequently requires speech recognition may be provided by the first speech signal. Based on this, it is possible to not limit the language used by the user to issue a voice wake-up word. For example, the user can issue a speech wake-up word using Mandarin, or can also use a first dialect to issue a speech wake-up word, or can also use a different dialect than the first dialect to issue a speech wake-up word.
但是,对同一用户来说,在使用终端设备过程中可以并且有可能使用同一语言方式向终端设备发出语音信号。也就是说,用户可能使用相同的方言向终端设备输入语音唤醒词和第一语音信号。针对这些应用场景,终端设备在接收到用户输入的第一语音信号之后,可优先从第一语音信号中解析第一方言;未能从第一语音信号中解析出第一方言,则可以识别语音唤醒词所属的方言作为第一方言。其中,具体识别语音唤醒词所属方言的实施方式与上述实施例中识别语音唤醒词所属方言的实施方式相同,在此不再赘述。However, for the same user, it is possible and possible to issue a voice signal to the terminal device in the same language mode during use of the terminal device. That is, the user may input the voice wake-up word and the first voice signal to the terminal device using the same dialect. For the application scenario, after receiving the first voice signal input by the user, the terminal device may preferentially parse the first dialect from the first voice signal; if the first dialect cannot be parsed from the first voice signal, the voice may be recognized. The dialect to which the awakening word belongs is used as the first dialect. The implementation manner of specifically identifying the dialect to which the voice wake-up word belongs is the same as the embodiment of the dialect in which the voice wake-up word is recognized in the above embodiment, and details are not described herein again.
方式B:上述结合语音唤醒词和第一语音信号的语音识别方法由终端设备和服务器相互配合实施。在方式B中,终端设备主要用于接收用户输入的语音唤醒词和第一语音 信号并上报给服务器,以供服务器从第一语音信号中解析出第一方言,这点不同与方式A中的终端设备。相应地,服务器除了面向不同方言提供ASR模型并选择相应ASR模型对相应方言下的语音信号进行语音识别之外,还具有从第一语音信号中解析出第一方言的功能。Mode B: The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server. In the mode B, the terminal device is mainly configured to receive the voice wake-up word and the first voice signal input by the user and report the signal to the server, so that the server parses the first dialect from the first voice signal, which is different from the mode A. Terminal Equipment. Correspondingly, the server provides the ASR model for different dialects and selects the corresponding ASR model for speech recognition of the speech signal in the corresponding dialect. It also has the function of parsing the first dialect from the first speech signal.
在方式B中,当用户想要进行语音识别时,可以向终端设备输入语音唤醒词。终端设备接收用户输入的语音唤醒词,并将该语音唤醒词发送至服务器。服务器基于语音唤醒词,唤醒自身的语音识别功能。用户在输入语音唤醒词后,可继续向终端设备发送第一语音信号。终端设备将接收到的第一语音信号发送至服务器。服务器从第一语音信号中解析出第一方言,并从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,以便于后续基于第一方言对应的ASR模型对第一方言下的语音信号进行语音识别。In mode B, when the user wants to perform voice recognition, a voice wake-up word can be input to the terminal device. The terminal device receives the voice wake-up word input by the user, and sends the voice wake-up word to the server. The server wakes up its own speech recognition function based on the voice wake up words. After inputting the voice wake-up word, the user may continue to send the first voice signal to the terminal device. The terminal device transmits the received first voice signal to the server. The server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the subsequent voice of the first dialect based on the ASR model corresponding to the first dialect. The signal is speech recognized.
终端设备在向服务器发送第一语音信号后,继续向服务器发送待识别语音信号。服务器在选择第一方言对应的ASR模型后,会利用第一方言对应的ASR模型对待识别语音进行语音识别。可选地,待识别语音可以是用户在输入第一语音信号后,继续向终端设备输入的语音信号,基于此,终端设备在向服务器发送待识别语音信号之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在终端设备本地的语音信号。After transmitting the first voice signal to the server, the terminal device continues to send the to-be-identified voice signal to the server. After selecting the ASR model corresponding to the first dialect, the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech. Optionally, the to-be-identified voice may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and the terminal device may further receive the user input to be recognized before sending the to-be-identified voice signal to the server. voice signal. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.
在一些示例性实施例中,服务器在从不同方言对应的ASR模型中选择第一方言对应的ASR模型之前,还包括:若未能从第一语音信号中解析出第一方言,识别语音唤醒词所属的方言作为第一方言。In some exemplary embodiments, before the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying the voice wake-up words The dialect to which it belongs is the first dialect.
在一些示例性实施例中,服务器在从第一语音信号中解析出需要进行语音识别的第一方言时,包括:基于声学模型将第一语音信号转换为第一音素序列;将存储器中存储的不同方言名称对应的音素片段分别在第一音素序列中进行匹配;当在第一音素序列中匹配中音素片段时,将匹配中的音素片段对应的方言作为第一方言。In some exemplary embodiments, the server, when parsing the first dialect that requires speech recognition from the first speech signal, includes: converting the first speech signal to the first phoneme sequence based on the acoustic model; storing the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
方式C:上述结合语音唤醒词和第一语音信号的语音识别方法由终端设备或服务器单独实施。在方式C中,当用户想要进行语音识别时,可以向终端设备或服务器输入语音唤醒词。终端设备或服务器根据用户输入的语音唤醒词,唤醒语音识别功能。用户在输入语音唤醒词后,可继续向终端设备或者服务器输入具有方言指导意义的第一语音信号。终端设备或服务器从第一语音信号中解析出第一方言,并从不同方言对应的ASR模型中,选择第一方言对应的ASR模型。Mode C: The above voice recognition method combining the voice wake-up word and the first voice signal is separately implemented by the terminal device or the server. In the mode C, when the user wants to perform voice recognition, a voice wake-up word can be input to the terminal device or the server. The terminal device or the server wakes up the voice recognition function according to the voice wake-up word input by the user. After inputting the voice wake-up word, the user may continue to input the first voice signal having the dialect guiding meaning to the terminal device or the server. The terminal device or the server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
终端设备或服务器在选择第一方言对应的ASR模型后,会利用第一方言对应的ASR 模型对待识别语音进行语音识别。可选地,待识别语音可以是用户在输入第一语音信号后,继续向终端设备或服务器输入的语音信号,基于此,终端设备或服务器在利用第一方言对应的ASR模型对待识别语音信号进行语音识别之前,还可以接收用户输入的待识别语音信号。或者,待识别语音信号也可以是预先录制并存储在终端设备或服务器本地的语音信号,基于此,终端设备或服务器可以直接从本地获取待识别语音信号。After selecting the ASR model corresponding to the first dialect, the terminal device or the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech. Optionally, the to-be-identified voice may be a voice signal that the user continues to input to the terminal device or the server after inputting the first voice signal, and the terminal device or the server performs the voice signal to be recognized by using the ASR model corresponding to the first dialect. Before the voice recognition, the voice signal to be recognized input by the user may also be received. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally at the terminal device or the server, based on which the terminal device or the server may directly obtain the voice signal to be recognized from the local.
在一些示例性实施例中,终端设备或服务器在从不同方言对应的ASR模型中选择第一方言对应的ASR模型之前,还包括:若未能从第一语音信号中解析出第一方言,识别语音唤醒词所属的方言作为第一方言。In some exemplary embodiments, before the terminal device or the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying The dialect to which the voice wake-up word belongs is used as the first dialect.
在一些示例性实施例中,终端设备或服务器在从第一语音信号中解析出需要进行语音识别的第一方言时,包括:基于声学模型将第一语音信号转换为第一音素序列;将存储器中存储的不同方言名称对应的音素片段分别在第一音素序列中进行匹配;当在第一音素序列中匹配中音素片段时,将匹配中的音素片段对应的方言作为第一方言。In some exemplary embodiments, the terminal device or the server, when parsing the first dialect that needs to perform speech recognition from the first speech signal, includes: converting the first speech signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names stored in the first phoneme sequence are matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
可选地,在上述方式A、方式B以及方式C中,从第一语音信号中解析出需要进行语音识别的第一方言,包括:基于声学模型将第一语音信号转换为第一音素序列;将不同方言名称对应的音素片段分别在第一音素序列中进行匹配;当在第一音素序列中匹配中音素片段时,将匹配中的音素片段对应的方言作为第一方言。Optionally, in the foregoing manners A, B, and C, parsing the first dialect that needs to perform speech recognition from the first voice signal, including: converting the first voice signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
其中,在基于声学模型将第一语音信号转换为第一音素序列之前,需要对第一语音信号进行预处理和特征提取。其中预处理过程包括预加重、加窗分帧和端点检测。特征提取即对预处理后的第一语音信号进行时域特征或者频域特征等声学特征的提取。Wherein, before converting the first speech signal into the first phoneme sequence based on the acoustic model, preprocessing and feature extraction of the first speech signal are required. The preprocessing process includes pre-emphasis, windowing framing, and endpoint detection. The feature extraction is to extract the acoustic features such as time domain features or frequency domain features of the preprocessed first speech signal.
声学模型可将第一语音信号的声学特征转换为音素序列。音素是构成单词发音或者汉字发音的基本要素。其中,构成单词发音的音素可以是卡内基梅隆大学发明的39个音素;构成汉字发音的音素可以是全部声母和韵母。声学模型包括但不限于基于神经网络的深度学习模型、隐马尔可夫模型等。其中,将声学特征转换为音素序列的方式属于现有技术,此处不再赘述。The acoustic model can convert the acoustic characteristics of the first speech signal into a phoneme sequence. Phonemes are the basic elements that make up the pronunciation of a word or the pronunciation of a Chinese character. Among them, the phonemes constituting the pronunciation of a word may be 39 phonemes invented by Carnegie Mellon University; the phonemes constituting the pronunciation of Chinese characters may be all initials and finals. Acoustic models include, but are not limited to, neural network based deep learning models, hidden Markov models, and the like. The manner of converting the acoustic features into the phoneme sequences belongs to the prior art, and details are not described herein again.
终端设备或服务器在将第一语音信号转换为第一音素序列后,将不同方言名称对应的音素片段分别在第一音素序列中进行匹配。其中,可以预先存储不同方言名称的音素片段,例如方言名称“河南话”的音素片段、方言名称“闽南语”的音素片段、方言名称“British English”等。如果方言名称是单词,则音素片段是从卡内基梅隆大学发明的39个音素中获取的若干音素构成的片段。如果方言名称是汉字,则音素片段是方言名称的声母和韵母构成的片段。比较第一音素序列与预先存储的不同方言名称对应的音素片 段,以判断第一音素序列中是否包含与某个方言名称的音素片段相同或相似的音素片段。可选地,可以计算第一音素序列中各音素片段分别与不同方言名称的音素片段的相似度;从不同方言名称的音素片段中,选择与第一音素序列中某个音素片段的相似度满足预设相似度要求的音素片段作为匹配中的音频片段。然后,将匹配中的音素片段对应的方言作为第一方言。After converting the first voice signal into the first phoneme sequence, the terminal device or the server respectively matches the phoneme segments corresponding to the different dialect names in the first phoneme sequence. Wherein, phoneme fragments of different dialect names may be pre-stored, for example, a phoneme fragment of the dialect name "Henan dialect", a phoneme fragment of the dialect name "Minnan", a dialect name "British English", and the like. If the dialect name is a word, the phoneme fragment is a segment composed of several phonemes obtained from the 39 phonemes invented by Carnegie Mellon University. If the dialect name is a Chinese character, the phoneme fragment is a fragment composed of the initials and finals of the dialect name. The phoneme segments corresponding to the different dialect names stored in advance are compared to determine whether the first phoneme sequence contains a phoneme segment identical or similar to the phoneme segment of a certain dialect name. Optionally, a similarity between each phoneme segment in the first phoneme sequence and a phoneme segment in a different dialect name may be calculated; and from a phoneme segment of a different dialect name, selecting a similarity with a phoneme segment in the first phoneme sequence satisfies The phoneme fragment required by the preset similarity is used as the audio segment in the matching. Then, the dialect corresponding to the phoneme segment in the match is used as the first dialect.
值得说明的是,上述方式A、方式B以及方式C中有一些步骤或内容与图1-图7所示实施例中的一些步骤或内容相同或相似,这些相同或相似的内容可参见图1-图7所示实施例中的描述,在此不再赘述。It should be noted that some steps or contents in the foregoing manners A, B, and C are the same or similar to those in the embodiments shown in FIG. 1 to FIG. 7. The same or similar contents can be seen in FIG. - The description in the embodiment shown in FIG. 7 will not be repeated here.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如201、202等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。In addition, some of the processes described in the above-described embodiments and the accompanying drawings include a plurality of operations occurring in a specific order, but it should be clearly understood that the operations may be performed in the order in which they are presented or executed in parallel. The serial number of the operation, such as 201, 202, etc., is only used to distinguish different operations, and the serial number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be performed sequentially or in parallel. It should be noted that the descriptions of “first” and “second” in this document are used to distinguish different messages, devices, modules, etc., and do not represent the order, nor the “first” and “second”. It is a different type.
图8为本申请又一示例性实施例提供的一种语音识别装置的模块结构示意图。如图8所示,语音识别装置800包括接收模块801、识别模块802、第一发送模块803和第二发送模块804。FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 8, the voice recognition apparatus 800 includes a receiving module 801, an identifying module 802, a first transmitting module 803, and a second transmitting module 804.
接收模块801,用于接收语音唤醒词。The receiving module 801 is configured to receive a voice wake-up word.
识别模块802,用于识别接收模块801接收的语音唤醒词所属的第一方言。The identification module 802 is configured to identify a first dialect to which the voice wake-up word received by the receiving module 801 belongs.
第一发送模块803,用于向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型。The first sending module 803 is configured to send a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
第二发送模块804,用于向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The second sending module 804 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
在一可选实施方式中,识别模块802在识别语音唤醒词所属的第一方言时,具体用于:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤 醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In an optional implementation manner, when the recognition module 802 identifies the first dialect to which the voice wake-up word belongs, the method is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and acquire and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
在一可选实施方式中,接收模块801在接收语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In an optional implementation manner, when receiving the voice wake-up word, the receiving module 801 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface. .
在一可选实施方式中,第二发送模块804在向服务器发送待识别语音信号之前,还用于:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In an optional implementation manner, before sending the to-be-identified voice signal to the server, the second sending module 804 is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
在一可选实施方式中,第二发送模块804在输出语音输入提示信息之前,还用于:接收服务器返回的通知消息,通知消息用于指示已选择第一方言对应的ASR模型。In an optional implementation manner, before outputting the voice input prompt information, the second sending module 804 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
在一可选实施方式中,接收模块801在接收语音唤醒词之前,还用于:响应于唤醒词自定义操作,接收用户输入的自定义语音信号;将自定义语音信号保存为语音唤醒词。以上描述了语音识别装置800的内部功能和结构,如图9所示,实际中,该语音识别装置800可实现为一种终端设备,包括:存储器901、处理器902以及通信组件903。In an optional implementation manner, before receiving the voice wake-up word, the receiving module 801 is further configured to: receive a custom voice signal input by the user in response to the wake-up word customization operation; and save the customized voice signal as a voice wake-up word. The internal function and structure of the speech recognition apparatus 800 are described above. As shown in FIG. 9, in actuality, the speech recognition apparatus 800 can be implemented as a terminal apparatus, including: a memory 901, a processor 902, and a communication component 903.
存储器901,用于存储计算机程序,并可被存储为存储其它各种数据以支持在终端设备上的操作。这些数据的示例包括用于在终端设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 901 is configured to store a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
存储器901可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 901 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器902,与存储器901耦合,用于执行存储器901中的计算机程序,以用于:通过通信组件903接收语音唤醒词;识别语音唤醒词所属的第一方言;通过通信组件903向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择第一方言对应的ASR模型;通过通信组件903向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The processor 902 is coupled to the memory 901 for executing a computer program in the memory 901 for: receiving a voice wake-up word through the communication component 903; identifying a first dialect to which the voice wake-up word belongs; and transmitting a service to the server through the communication component 903 The request is to request the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and send the to-be-identified voice signal to the server through the communication component 903, so that the server uses the ASR model corresponding to the first dialect to perform the recognized speech signal. Speech Recognition.
通信组件903用于接收所述语音唤醒词,向所述服务器发送所述服务请求以及所述待识别语音信号。The communication component 903 is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
在一可选实施方式中,处理器902在识别语音唤醒词所属的第一方言时,具体用于:In an optional implementation manner, when the processor 902 identifies the first dialect to which the voice wake-up word belongs, the processor 902 is specifically configured to:
将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者 将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。The voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects, and the dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement is obtained as the first dialect; or the voice is spoken The acoustic features of the wake-up words are respectively matched with the acoustic features of different dialects, and the dialects that match the acoustic characteristics of the speech wake-up words according to the second setting requirement are obtained as the first dialect; or the speech wake-up words are converted into the text wake-up words. The text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the dialect corresponding to the reference text wake-up words whose matching degree with the text wake-up words meets the third setting requirement is obtained as the first dialect.
在一可选实施方式中,如图9所示,该终端设备还包括:显示屏904。基于此,处理器902在接收语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,通过显示屏904向用户展示语音输入界面;并基于语音输入界面获取用户输入的语音唤醒词。In an optional implementation manner, as shown in FIG. 9, the terminal device further includes: a display screen 904. Based on this, when receiving the voice wake-up word, the processor 902 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 904; and acquire a voice wake-up word input by the user based on the voice input interface. .
在一可选实施方式中,该终端设备还包括:音频组件906。基于此,处理器902在向服务器发送待识别语音信号之前,还用于:通过音频组件906输出语音输入提示信息,以提示用户进行语音输入;通过音频组件906接收用户输入的待识别语音信号。相应地,音频组件906还用于输出语音输入提示信息,并接收用户输入的待识别语音信号。In an optional implementation, the terminal device further includes: an audio component 906. Based on this, before transmitting the to-be-identified voice signal to the server, the processor 902 is further configured to: output the voice input prompt information through the audio component 906 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user through the audio component 906. Correspondingly, the audio component 906 is further configured to output voice input prompt information and receive a voice signal to be recognized input by the user.
在一可选实施方式中,处理器902在输出语音输入提示信息之前,还用于:通过通信组件903接收服务器返回的通知消息,通知消息用于指示已选择第一方言对应的ASR模型。In an optional implementation, before outputting the voice input prompt information, the processor 902 is further configured to: receive, by the communication component 903, a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
在一可选实施方式中,处理器902在接收语音唤醒词之前,还用于:响应于唤醒词自定义操作,通过通信组件903接收用户输入的自定义语音信号;将自定义语音信号保存为语音唤醒词。In an optional implementation, before receiving the voice wake-up word, the processor 902 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 903; save the customized voice signal as Voice wake up words.
进一步,如图9所示,该终端设备还包括:电源组件905等其它组件。Further, as shown in FIG. 9, the terminal device further includes: a power component 905 and other components.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由终端设备执行的各步骤。Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the terminal device in the foregoing method embodiment can be implemented.
图10为本申请又一示例性实施例提供的另一种语音识别装置的模块结构示意图。如图10所示,语音识别装置1000包括第一接收模块1001、选择模块1002、第二接收模块1003和识别模块1004。FIG. 10 is a schematic structural diagram of a module of another voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 10, the voice recognition apparatus 1000 includes a first receiving module 1001, a selecting module 1002, a second receiving module 1003, and an identifying module 1004.
第一接收模块1001,用于接收终端设备发送的服务请求,服务请求指示选择第一方言对应的ASR模型。The first receiving module 1001 is configured to receive a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected.
选择模块1002,用于从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,第一方言是语音唤醒词所属的方言。The selecting module 1002 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, where the first dialect is a dialect to which the voice wake-up word belongs.
第二接收模块1003,用于接收终端设备发送的待识别语音信号。The second receiving module 1003 is configured to receive a to-be-identified voice signal sent by the terminal device.
识别模块1004,用于利用第一方言对应的ASR模型对第二接收模块1003接收的待识别语音信号进行语音识别。The identification module 1004 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1003 by using the ASR model corresponding to the first dialect.
在一可选实施方式中,语音识别装置1000还包括构建模块,用于在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation, the voice recognition apparatus 1000 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
以上描述了语音识别装置1000的内部功能和结构,如图11所示,实际中,该语音识别装置1000可实现一种服务器,包括:存储器1101、处理器1102以及通信组件1103。The internal function and structure of the speech recognition apparatus 1000 have been described above. As shown in FIG. 11, in practice, the speech recognition apparatus 1000 can implement a server including a memory 1101, a processor 1102, and a communication component 1103.
存储器1101,用于存储计算机程序,并可被存储为存储其它各种数据以支持在服务器上的操作。这些数据的示例包括用于在服务器上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 1101 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.
存储器1101可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 1101 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器1102,与存储器1101耦合,用于执行存储器1101中的计算机程序,以用于:通过通信组件1103接收终端设备发送的服务请求,服务请求指示选择第一方言对应的ASR模型;从不同方言对应的ASR模型中,选择第一方言对应的ASR模型,第一方言是语音唤醒词所属的方言;通过通信组件1103接收终端设备发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The processor 1102 is coupled to the memory 1101, and is configured to execute a computer program in the memory 1101, configured to: receive, by the communication component 1103, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected; from different dialects In the corresponding ASR model, the ASR model corresponding to the first dialect is selected, and the first dialect is a dialect to which the voice wake-up word belongs; the communication component 1103 receives the to-be-identified voice signal sent by the terminal device, and treats the ASR model corresponding to the first dialect. The speech signal is recognized for speech recognition.
通信组件1103,用于接收所述服务请求和所述待识别语音信号。The communication component 1103 is configured to receive the service request and the to-be-identified voice signal.
在一可选实施方式中,处理器1102在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,还用于:收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation, the processor 1102 is configured to: collect corpora of different dialects before extracting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; perform feature extraction on corpora of different dialects, In order to obtain the acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.
进一步,如图11所示,该服务器还包括:音频组件1106。基于此,处理器1102还用于:通过音频组件1106接收终端设备发送的待识别语音信号。Further, as shown in FIG. 11, the server further includes an audio component 1106. Based on this, the processor 1102 is further configured to: receive, by the audio component 1106, the to-be-identified voice signal sent by the terminal device.
可选地,如图11所示,该服务器还包括显示屏1104、电源组件1105等其它组件。Optionally, as shown in FIG. 11, the server further includes a display 1104, a power component 1105, and the like.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由服务器执行的各步骤。Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the server in the foregoing method embodiment can be implemented.
本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的 ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-dial speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to Improve the efficiency of multi-word speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高对多方言语音进行识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
图12为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图。如图12所示,语音识别装置1200包括接收模块1201、第一发送模块1202和第二发送模块1203。FIG. 12 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 12, the voice recognition apparatus 1200 includes a receiving module 1201, a first transmitting module 1202, and a second transmitting module 1203.
接收模块1201,用于接收语音唤醒词。The receiving module 1201 is configured to receive a voice wake-up word.
第一发送模块1202,用于向服务器发送接收模块1201接收的语音唤醒词,以供服务器基于语音唤醒词从不同方言对应的ASR模型中选择语音唤醒词所属第一方言对应的ASR模型。The first sending module 1202 is configured to send, to the server, the voice wake-up words received by the receiving module 1201, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words.
第二发送模块1203,用于向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The second sending module 1203 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.
在一可选实施方式中,接收模块1201在接收语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In an optional implementation manner, when receiving the voice wake-up word, the receiving module 1201 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface. .
在一可选实施方式中,第二发送模块1203在在向服务器发送待识别语音信号之前,还用于:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In an optional implementation manner, before sending the to-be-identified voice signal to the server, the second sending module 1203 is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
在一可选实施方式中,第二发送模块1203在输出语音输入提示信息之前,还用于:接收服务器返回的通知消息,通知消息用于指示已选择第一方言对应的ASR模型。In an optional implementation manner, before outputting the voice input prompt information, the second sending module 1203 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
在一可选实施方式中,接收模块1201在接收语音唤醒词之前,还用于:响应于唤醒词自定义操作,接收用户输入的自定义语音信号。第一发送模块1202还用于将自定义语音信号上传至服务器。In an optional implementation, before receiving the voice wake-up word, the receiving module 1201 is further configured to: receive a customized voice signal input by the user in response to the wake-up word customization operation. The first sending module 1202 is further configured to upload a customized voice signal to the server.
以上描述了语音识别装置1200的内部功能和结构,如图13所示,实际中,该语音识别装置1200可实现为一种终端设备,包括:存储器1301、处理器1302以及通信组件1303。The internal function and structure of the speech recognition apparatus 1200 are described above. As shown in FIG. 13, in actuality, the speech recognition apparatus 1200 can be implemented as a terminal device, including: a memory 1301, a processor 1302, and a communication component 1303.
存储器1301,用于存储计算机程序,并可被存储为存储其它各种数据以支持在终端 设备上的操作。这些数据的示例包括用于在终端设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 1301 is for storing a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
存储器1301可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 1301 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器1302,与存储器1301耦合,用于执行存储器1301中的计算机程序,以用于:通过通信组件1303接收语音唤醒词;通过通信组件1303向服务器发送语音唤醒词,以供服务器基于语音唤醒词从不同方言对应的ASR模型中选择语音唤醒词所属第一方言对应的ASR模型;通过通信组件1303向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The processor 1302 is coupled to the memory 1301 for executing a computer program in the memory 1301 for: receiving a voice wake-up word through the communication component 1303; and transmitting a voice wake-up word to the server through the communication component 1303 for the server to wake-up words based on the voice The ASR model corresponding to the first dialect to which the voice wake-up word belongs is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component 1303, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal. Identification.
通信组件1303,用于接收所述语音唤醒词,向所述服务器发送所述语音唤醒词和所述待识别语音信号The communication component 1303 is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server
在一可选实施方式中,如图13所示,该终端设备还包括显示屏1304。基于此,处理器1302在接收语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,通过显示屏1304向用户展示语音输入界面;并基于语音输入界面获取用户输入的语音唤醒词。In an alternative embodiment, as shown in FIG. 13, the terminal device further includes a display screen 1304. Based on this, when receiving the voice wake-up word, the processor 1302 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1304; and acquire a voice wake-up word input by the user based on the voice input interface. .
在一可选实施方式中,如图13所示,该终端设备还包括音频组件1306。基于此,处理器1302用于:通过音频组件1306接收语音唤醒词。相应地,处理器1302在向服务器发送待识别语音信号之前,还用于:通过音频组件1306输出语音输入提示信息,以提示用户进行语音输入;以及接收用户输入的待识别语音信号。In an alternative embodiment, as shown in FIG. 13, the terminal device further includes an audio component 1306. Based on this, the processor 1302 is configured to receive the speech wake-up words through the audio component 1306. Correspondingly, before sending the to-be-identified voice signal to the server, the processor 1302 is further configured to: output the voice input prompt information through the audio component 1306 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.
在一可选实施方式中,处理器1302在输出语音输入提示信息之前,还用于:接收服务器返回的通知消息,通知消息用于指示已选择第一方言对应的ASR模型。In an optional implementation, before outputting the voice input prompt information, the processor 1302 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
在一可选实施方式中,处理器1302在接收语音唤醒词之前,还用于:响应于唤醒词自定义操作,通过通信组件1303接收用户输入的自定义语音信号,并将自定义语音信号上传至服务器。In an optional implementation, before receiving the voice wake-up word, the processor 1302 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1303, and upload the customized voice signal. To the server.
进一步,如图13所示,该终端设备还包括:电源组件1305等其它组件。Further, as shown in FIG. 13, the terminal device further includes: a power component 1305 and other components.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由终端设备执行的各步骤。Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the terminal device in the foregoing method embodiment can be implemented.
图14为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图。如图14所示,语音识别装置1400包括第一接收模块1401、第一识别模块1402、选择模块 1403、第二接收模块1404、第二识别模块1405。FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 14, the voice recognition apparatus 1400 includes a first receiving module 1401, a first identifying module 1402, a selecting module 1403, a second receiving module 1404, and a second identifying module 1405.
第一接收模块1401,用于接收终端设备发送的语音唤醒词。The first receiving module 1401 is configured to receive a voice wake-up word sent by the terminal device.
第一识别模块1402,用于识别语音唤醒词所属的第一方言。The first identification module 1402 is configured to identify a first dialect to which the voice wake-up word belongs.
选择模块1403,用于从不同方言对应的ASR模型中,选择第一方言对应的ASR模型。The selecting module 1403 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
第二接收模块1404,用于接收终端设备发送的待识别语音信号。The second receiving module 1404 is configured to receive a to-be-identified voice signal sent by the terminal device.
第二识别模块1405,用于利用第一方言对应的ASR模型对第二接收模块1404接收的待识别语音信号进行语音识别。The second identification module 1405 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1404 by using the ASR model corresponding to the first dialect.
在一可选实施方式中,第一识别模块1402在识别语音唤醒词所属的第一方言时,具体用于:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the first identifier module 1402 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice The dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained. The dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.
在一可选实施方式中,语音识别装置1400还包括构建模块,用于在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation, the speech recognition apparatus 1400 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
以上描述了语音识别装置1400的内部功能和结构,如图15所示,实际中,该语音识别装置1400可实现为一种服务器,包括:存储器1501、处理器1502以及通信组件1503。The internal function and structure of the speech recognition apparatus 1400 are described above. As shown in FIG. 15, in actuality, the speech recognition apparatus 1400 can be implemented as a server including: a memory 1501, a processor 1502, and a communication component 1503.
存储器1501,用于存储计算机程序,并可被存储为存储其它各种数据以支持在服务器上的操作。这些数据的示例包括用于在服务器上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 1501 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.
存储器1501可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 1501 can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器1502,与存储器1501耦合,用于执行存储器1501中的计算机程序,以用于:通过通信组件1503接收终端设备发送的语音唤醒词;识别语音唤醒词所属的第一方言; 从不同方言对应的ASR模型中,选择第一方言对应的ASR模型;通过通信组件1503接收终端设备发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The processor 1502 is coupled to the memory 1501 for executing a computer program in the memory 1501, for: receiving, by the communication component 1503, a voice wake-up word sent by the terminal device; identifying a first dialect to which the voice wake-up word belongs; corresponding to different dialects In the ASR model, the ASR model corresponding to the first dialect is selected; the voice signal to be recognized sent by the terminal device is received by the communication component 1503, and the voice signal to be recognized is identified by the ASR model corresponding to the first dialect.
通信组件1503,用于接收语音唤醒词以及待识别语音信号。The communication component 1503 is configured to receive a voice wake-up word and a voice signal to be recognized.
在一可选实施方式中,处理器1502在识别语音唤醒词所属的第一方言时,具体用于:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the processor 1502 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
在一可选实施方式中,处理器1502在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,还用于收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation manner, the processor 1502 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.
进一步,如图15所示,该服务器还包括:音频组件1506。基于此,处理器1502用于:通过音频组件1506接收终端设备发送的语音唤醒词,并通过音频组件1506接收所述终端设备发送的待识别语音信号。Further, as shown in FIG. 15, the server further includes an audio component 1506. Based on this, the processor 1502 is configured to: receive, by the audio component 1506, a voice wake-up word sent by the terminal device, and receive, by the audio component 1506, the voice signal to be recognized sent by the terminal device.
进一步,如图15所示,该服务器还包括:显示屏1504、电源组件1505等其它组件。Further, as shown in FIG. 15, the server further includes: a display 1504, a power component 1505, and the like.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由服务器执行的各步骤。Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the server in the foregoing method embodiment can be implemented.
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高多方言语音识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.
图16为本申请又一示例性实施例提供的又一种语音识别装置的模块结构示意图。如图16所示,语音识别装置1600包括接收模块1601、第一识别模块1602、选择模块1603、 第二识别模块1604。FIG. 16 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 16, the voice recognition apparatus 1600 includes a receiving module 1601, a first identifying module 1602, a selecting module 1603, and a second identifying module 1604.
接收模块1601,用于接收语音唤醒词。The receiving module 1601 is configured to receive a voice wake-up word.
第一识别模块1602,用于识别语音唤醒词所属的第一方言。The first identification module 1602 is configured to identify a first dialect to which the voice wake-up word belongs.
选择模块1603,用于从不同方言对应的ASR模型中选择第一方言对应的ASR模型。The selecting module 1603 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.
第二识别模块1604,用于利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The second identification module 1604 is configured to perform speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.
在一可选实施方式中,第一识别模块1602在识别语音唤醒词所属的第一方言时,具体用于:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the first identifying module 1602 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice The dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained. The dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.
在一可选实施方式中,接收模块1601在接收终端设备发送的语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,向用户展示语音输入界面;基于语音输入界面获取用户输入的语音唤醒词。In an optional implementation manner, when receiving the voice wake-up word sent by the terminal device, the receiving module 1601 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire user input based on the voice input interface. Voice wake up words.
在一可选实施方式中,第二识别模块1604在利用第一方言对应的ASR模型对待识别语音信号进行语音识别之前,还用于:输出语音输入提示信息,以提示用户进行语音输入;接收用户输入的待识别语音信号。In an optional implementation manner, the second identification module 1604 is further configured to: output voice input prompt information to prompt the user to perform voice input, and receive the user before using the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized. The voice signal to be recognized is input.
在一可选实施方式中,语音识别装置1600还包括构建模块,用于在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation, the voice recognition apparatus 1600 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in an ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.
在一可选实施方式中,接收模块1601在接收语音唤醒词之前,还用于:响应于唤醒词自定义操作,接收用户输入的自定义语音信号;将自定义语音信号保存为语音唤醒词。In an optional implementation, before receiving the voice wake-up word, the receiving module 1601 is further configured to: receive a custom voice signal input by the user in response to the wake-up word custom operation; and save the customized voice signal as a voice wake-up word.
以上描述了语音识别装置1600的内部功能和结构,如图17所示,实际中,该语音识别装置1600可实现为一种电子设备,包括:存储器1701、处理器1702以及通信组件1703。该电子设备可以是终端设备,也可以是服务器。The internal function and structure of the speech recognition apparatus 1600 are described above. As shown in FIG. 17, in practice, the speech recognition apparatus 1600 can be implemented as an electronic device including: a memory 1701, a processor 1702, and a communication component 1703. The electronic device can be a terminal device or a server.
存储器1701,用于存储计算机程序,并可被存储为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指 令,联系人数据,电话簿数据,消息,图片,视频等。The memory 1701 is configured to store a computer program and can be stored to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method for operation on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
存储器1701可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 1701 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器1702,与存储器1701耦合,用于执行存储器1701中的计算机程序,以用于:通过通信组件1703接收语音唤醒词;识别语音唤醒词所属的第一方言;从不同方言对应的ASR模型中选择第一方言对应的ASR模型;利用第一方言对应的ASR模型对待识别语音信号进行语音识别。The processor 1702 is coupled to the memory 1701 for executing a computer program in the memory 1701 for: receiving a speech wake-up word through the communication component 1703; identifying a first dialect to which the speech wake-up word belongs; from an ASR model corresponding to different dialects The ASR model corresponding to the first dialect is selected; the ASR model corresponding to the first dialect is used to perform speech recognition on the speech signal to be recognized.
通信组件1703,用于接收语音唤醒词。The communication component 1703 is configured to receive a voice wake-up word.
在一可选实施方式中,处理器1702在识别语音唤醒词所属的第一方言时,具体用于:将语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为第一方言;或者将语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为第一方言;或者将语音唤醒词转换成文本唤醒词,将文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为第一方言。In an optional implementation manner, when the processor 1702 identifies the first dialect to which the voice wake-up word belongs, the processor 1702 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.
在一可选实施方式中,如图17所示,该电子设备还包括:显示屏1704。基于此,处理器1702在接收终端设备发送的语音唤醒词时,具体用于:响应于激活或开启终端设备的指令,通过显示屏1704向用户展示语音输入界面;并基于语音输入界面获取用户输入的语音唤醒词。In an optional implementation, as shown in FIG. 17, the electronic device further includes: a display screen 1704. Based on this, when receiving the voice wake-up words sent by the terminal device, the processor 1702 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1704; and obtain user input based on the voice input interface. Voice wake up words.
在一可选实施方式中,如图17所示,该电子设备还包括:音频组件1706。基于此,处理器1702在利用第一方言对应的ASR模型对待识别语音信号进行语音识别之前,还用于:通过音频组件1706输出语音输入提示信息,以提示用户进行语音输入;并接收用户输入的待识别语音信号。相应地,处理器1702还用于:通过音频组件1706接收语音唤醒词。In an alternative embodiment, as shown in FIG. 17, the electronic device further includes an audio component 1706. Based on this, before the speech recognition of the speech signal to be recognized by the ASR model corresponding to the first dialect, the processor 1702 is further configured to: output the voice input prompt information through the audio component 1706 to prompt the user to perform voice input; and receive the user input. The speech signal to be recognized. Correspondingly, the processor 1702 is further configured to: receive the speech wake-up words through the audio component 1706.
在一可选实施方式中,处理器1702在从不同方言对应的ASR模型中,选择第一方言对应的ASR模型之前,还用于收集不同方言的语料;对不同方言的语料进行特征提取,以得到不同方言的声学特征;根据不同方言的声学特征,构建不同方言对应的ASR模型。In an optional implementation manner, the processor 1702 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.
在一可选实施方式中,处理器1702在接收语音唤醒词之前,还用于:响应于唤醒词 自定义操作,通过通信组件1703接收用户输入的自定义语音信号;将自定义语音信号保存为语音唤醒词。进一步,如图17所示,该电子设备还包括:电源组件1705等其它组件。In an optional implementation, before receiving the voice wake-up word, the processor 1702 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1703; save the customized voice signal as Voice wake up words. Further, as shown in FIG. 17, the electronic device further includes: a power component 1705 and other components.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由电子设备执行的各步骤。Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the electronic device in the foregoing method embodiment can be implemented.
在本实施例中,针对不同方言构建ASR模型,在语音识别过程中,预先识别语音唤醒词所属的方言,进而从不同方言对应的ASR模型中选择与语音唤醒词所属的方言对应的ASR模型,利用所选择的ASR模型对后续待识别语音信号进行语音识别,实现多方言语音识别的自动化,并且基于语音唤醒词自动选择相应方言的ASR模型,无需用户手动操作,实现起来更加方便、快捷,有利于提高多方言语音识别的效率。In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.
进一步地,基于语音唤醒词比较简短,识别语音唤醒词所属的方言的过程耗时较短,使得语音识别系统能够快速识别语音唤醒词所属的第一方言,并选择与第一方言对应的ASR模型,进一步提高对多方言语音进行识别的效率。Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.
本申请实施例还提供一种终端设备,包括:存储器、处理器和通信组件。The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component.
存储器,用于存储计算机程序,并可被存储为存储其它各种数据以支持在终端设备上的操作。这些数据的示例包括用于在终端设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。A memory for storing a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.
存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器,与存储器和通信组件耦合,用于执行存储器中的计算机程序,以用于:通过通信组件接收语音唤醒词,以唤醒语音识别功能;通过通信组件接收用户输入的具有方言指示意义的第一语音信号;从第一语音信号中解析出需要进行语音识别的第一方言;从不同方言对应的ASR模型中选择第一方言对应的ASR模型;通过通信组件向服务器发送服务请求,以请求服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;通过通信组件向服务器发送待识别语音信号,以供服务器利用第一方言对应的ASR模型对待识别语音信号进行语音识别。a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and transmitting a service request to the server through the communication component to request the server The ASR model corresponding to the first dialect is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.
所述通信组件,用于接收语音唤醒词和所述第一语音信号,以及向所述服务器发送服务请求和待识别语音信号。The communication component is configured to receive a voice wake-up word and the first voice signal, and send a service request and a voice signal to be recognized to the server.
在一可选实施方式中,处理器在向服务器发送服务请求之前,还用于:若未能从第一语音信号中解析出第一方言,识别语音唤醒词所属的方言作为第一方言。In an optional implementation, before sending the service request to the server, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify a dialect to which the voice wake-up word belongs as the first dialect.
在一可选实施方式中,存储器还用于存储不同方言名称对应的音素片段。相应地,处理器在从第一语音信号中解析出需要进行语音识别的第一方言时,具体用于:基于声学模型将所述第一语音信号转换为第一音素序列;将存储器中存储的不同方言名称对应的音素片段分别在所述第一音素序列中进行匹配;当在所述第一音素序列中匹配中音素片段时,将所述匹配中的音素片段对应的方言作为所述第一方言。In an alternative embodiment, the memory is further configured to store phoneme segments corresponding to different dialect names. Correspondingly, when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
本申请实施例还提供一种服务器,包括:存储器、处理器和通信组件。The embodiment of the present application further provides a server, including: a memory, a processor, and a communication component.
存储器,用于存储计算机程序,并可被存储为存储其它各种数据以支持在服务器上的操作。这些数据的示例包括用于在服务器上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。A memory for storing computer programs and stored to store various other data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.
存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器,与存储器和通信组件耦合,用于执行存储器中的计算机程序,以用于:通过通信组件接收终端设备发送的语音唤醒词,以唤醒语音识别功能;通过通信组件接收终端设备发送的具有方言指示意义的第一语音信号;从第一语音信号中解析出需要进行语音识别的第一方言;从不同方言对应的ASR模型中选择第一方言对应的ASR模型;通过通信组件接收终端设备发送的待识别语音信号,并利用第一方言对应的ASR模型对待识别语音信号进行语音识别。a processor, coupled to the memory and the communication component, for executing a computer program in the memory, for: receiving, by the communication component, a voice wake-up word sent by the terminal device to wake up the voice recognition function; receiving, by the communication component, the terminal device sends the a dialect indicating a meaning of the first voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; receiving the terminal device by the communication component The speech signal to be recognized, and the speech signal to be recognized by the ASR model corresponding to the first dialect is used for speech recognition.
通信组件,用于接收语音唤醒词、第一语音信号和待识别语音信号。And a communication component, configured to receive a voice wake-up word, a first voice signal, and a voice signal to be recognized.
在一可选实施方式中,处理器在从不同方言对应的ASR模型中选择第一方言对应的ASR模型之前,还用于:若未能从第一语音信号中解析出第一方言,识别语音唤醒词所属的方言作为第一方言。In an optional implementation, before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify the voice The dialect to which the awakening word belongs is used as the first dialect.
在一可选实施方式中,存储器还用于存储不同方言名称对应的音素片段。相应地,处理器在从第一语音信号中解析出需要进行语音识别的第一方言时,具体用于:基于声学模型将所述第一语音信号转换为第一音素序列;将存储器中存储的不同方言名称对应的音素片段分别在所述第一音素序列中进行匹配;当在所述第一音素序列中匹配中音素片段时,将所述匹配中的音素片段对应的方言作为所述第一方言。In an alternative embodiment, the memory is further configured to store phoneme segments corresponding to different dialect names. Correspondingly, when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
本申请实施例还提供一种电子设备,该电子设备可以是终端设备,也可以是服务器。该电子设备包括:存储器、处理器和通信组件。The embodiment of the present application further provides an electronic device, which may be a terminal device or a server. The electronic device includes a memory, a processor, and a communication component.
存储器,用于存储计算机程序,并可被存储为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。A memory for storing computer programs and stored to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器,与存储器和通信组件耦合,用于执行存储器中的计算机程序,以用于:通过通信组件接收语音唤醒词,以唤醒语音识别功能;通过通信组件接收用户输入的具有方言指示意义的第一语音信号;从第一语音信号中解析出需要进行语音识别的第一方言;从不同方言对应的ASR模型中选择第一方言对应的ASR模型;利用第一方言对应的ASR模型对待识别语音信号进行语音识别。a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a speech signal; parsing a first dialect that requires speech recognition from the first speech signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and using the ASR model corresponding to the first dialect to treat the speech signal Perform speech recognition.
通信组件,用于接收语音唤醒词和第一语音信号。And a communication component, configured to receive the voice wake-up word and the first voice signal.
在一可选实施方式中,处理器在从不同方言对应的ASR模型中选择第一方言对应的ASR模型之前,还用于:若未能从第一语音信号中解析出第一方言,识别语音唤醒词所属的方言作为第一方言。In an optional implementation, before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify the voice The dialect to which the awakening word belongs is used as the first dialect.
在一可选实施方式中,存储器还用于存储不同方言名称对应的音素片段。相应地,处理器在从第一语音信号中解析出需要进行语音识别的第一方言时,具体用于:基于声学模型将所述第一语音信号转换为第一音素序列;将存储器中存储的不同方言名称对应的音素片段分别在所述第一音素序列中进行匹配;当在所述第一音素序列中匹配中音素片段时,将所述匹配中的音素片段对应的方言作为所述第一方言。In an alternative embodiment, the memory is further configured to store phoneme segments corresponding to different dialect names. Correspondingly, when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.
上述图9、图11、图13、图15和图17中的通信组件被存储为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication components in Figures 9, 11, 13, 15, and 17 above are stored to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component is located can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
上述图9、图11、图13、图15和图17中的显示屏包括液晶显示器(LCD)和触摸面板(TP)。如果显示屏包括触摸面板,显示屏可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。The display screens of FIGS. 9, 11, 13, 15, and 17 described above include a liquid crystal display (LCD) and a touch panel (TP). If the display includes a touch panel, the display can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
上述图9、图11、图13、图15和图17中的电源组件为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The power components in Figures 9, 11, 13, 15, and 17 above provide power to the various components of the device in which the power components are located. The power components can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.
上述图9、图11、图13、图15和图17中的音频组件可被存储为输出和/或输入音频信号。例如,音频组件包括一个麦克风(MIC),当音频组件所在设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被存储为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器或经由通信组件发送。在一些实施例中,音频组件还包括一个扬声器,用于输出音频信号。The audio components of Figures 9, 11, 13, 15, and 17 above may be stored as output and/or input audio signals. For example, the audio component includes a microphone (MIC) that is stored to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal can be further stored in a memory or transmitted via a communication component. In some embodiments, the audio component further includes a speaker for outputting an audio signal.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算 机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
在一个典型的存储中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical storage, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It is also to be understood that the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, Other elements not explicitly listed, or elements that are inherent to such a process, method, commodity, or equipment. An element defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device including the element.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above description is only an embodiment of the present application and is not intended to limit the application. Various changes and modifications can be made to the present application by those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included within the scope of the appended claims.

Claims (24)

  1. 一种语音识别方法,适用于终端设备,其特征在于,所述方法包括:A voice recognition method is applicable to a terminal device, and the method includes:
    接收语音唤醒词;Receiving a speech wake-up word;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  2. 根据权利要求1所述的方法,其特征在于,所述识别所述语音唤醒词所属的第一方言,包括:The method according to claim 1, wherein the identifying the first dialect to which the voice wake-up word belongs comprises:
    将所述语音唤醒词分别与以不同方言录制的基准唤醒词进行声学特征的动态匹配,获取与所述语音唤醒词的匹配度符合第一设定要求的基准唤醒词对应的方言作为所述第一方言;或者The voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects, and the dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement is obtained as the first One party; or
    将所述语音唤醒词的声学特征分别与不同方言的声学特征进行匹配,获取与所述语音唤醒词的声学特征的匹配度符合第二设定要求的方言作为所述第一方言;或者Correlating the acoustic features of the speech wake words with the acoustic features of different dialects respectively, and obtaining a dialect that matches the acoustic characteristics of the speech wake words according to the second setting requirement as the first dialect; or
    将所述语音唤醒词转换成文本唤醒词,将所述文本唤醒词分别与不同方言对应的基准文本唤醒词进行匹配,获取与所述文本唤醒词的匹配度符合第三设定要求的基准文本唤醒词对应的方言作为所述第一方言。Converting the voice wake-up words into text wake-up words, matching the text wake-up words with reference text wake-up words corresponding to different dialects, and obtaining a reference text that matches the text wake-up words according to the third set requirement The dialect corresponding to the wake-up word is used as the first dialect.
  3. 根据权利要求1所述的方法,其特征在于,所述接收语音唤醒词,包括:The method according to claim 1, wherein said receiving a speech wake-up word comprises:
    响应于激活或开启所述终端设备的指令,向用户展示语音输入界面;Presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device;
    基于所述语音输入界面获取所述用户输入的语音唤醒词。Acquiring a voice wake-up word input by the user based on the voice input interface.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,在向所述服务器发送待识别语音信号之前,所述方法还包括:The method according to any one of claims 1-3, wherein before the sending the to-be-identified voice signal to the server, the method further comprises:
    输出语音输入提示信息,以提示用户进行语音输入;Outputting a voice input prompt message to prompt the user to perform voice input;
    接收所述用户输入的待识别语音信号。Receiving a voice signal to be recognized input by the user.
  5. 根据权利要求4所述的方法,其特征在于,在输出语音输入提示信息之前,所述方法还包括:The method according to claim 4, wherein before the outputting the voice input prompt information, the method further comprises:
    接收所述服务器返回的通知消息,所述通知消息用于指示已选择所述第一方言对应的ASR模型。Receiving a notification message returned by the server, the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
  6. 根据权利要求1-3任一项所述的方法,其特征在于,在接收语音唤醒词之前, 所述方法还包括:The method according to any one of claims 1-3, wherein before receiving the speech wake-up word, the method further comprises:
    响应于唤醒词自定义操作,接收用户输入的自定义语音信号;Receiving a custom voice signal input by the user in response to the wake-up word custom operation;
    将所述自定义语音信号保存为所述语音唤醒词。The custom speech signal is saved as the speech wake-up word.
  7. 一种语音识别方法,适用于服务器,其特征在于,所述方法包括:A voice recognition method is applicable to a server, and the method includes:
    接收终端设备发送的服务请求,所述服务请求指示选择第一方言对应的ASR模型;Receiving a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;
    从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,所述第一方言是所述语音唤醒词所属的方言;Selecting, in the ASR model corresponding to different dialects, an ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word belongs;
    接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  8. 根据权利要求7所述的方法,其特征在于,在从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型之前,所述方法还包括:The method according to claim 7, wherein before the ASR model corresponding to the first dialect is selected from the ASR models corresponding to different dialects, the method further includes:
    收集不同方言的语料;Collect corpora of different dialects;
    对所述不同方言的语料进行特征提取,以得到不同方言的声学特征;Feature extraction of corpora of different dialects to obtain acoustic features of different dialects;
    根据所述不同方言的声学特征,构建不同方言对应的ASR模型。According to the acoustic characteristics of the different dialects, the ASR models corresponding to different dialects are constructed.
  9. 一种语音识别方法,适用于终端设备,其特征在于,所述方法包括:A voice recognition method is applicable to a terminal device, and the method includes:
    接收语音唤醒词;Receiving a speech wake-up word;
    向服务器发送所述语音唤醒词,以供服务器基于所述语音唤醒词从不同方言对应的ASR模型中选择所述语音唤醒词所属第一方言对应的ASR模型;Sending the voice wake-up word to the server, for the server to select an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;
    向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  10. 一种语音识别方法,适用于服务器,其特征在于,所述方法包括:A voice recognition method is applicable to a server, and the method includes:
    接收终端设备发送的语音唤醒词;Receiving a voice wake-up word sent by the terminal device;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  11. 一种语音识别方法,其特征在于,包括:A speech recognition method, comprising:
    接收语音唤醒词;Receiving a speech wake-up word;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;
    利用所述第一方言对应的ASR模型对待识别语音信号进行语音识别。The ASR model corresponding to the first dialect is used for speech recognition of the speech signal to be recognized.
  12. 一种语音识别方法,适用于终端设备,其特征在于,所述方法包括:A voice recognition method is applicable to a terminal device, and the method includes:
    接收语音唤醒词,以唤醒语音识别功能;Receiving a voice wake-up word to wake up the voice recognition function;
    接收用户输入的具有方言指示意义的第一语音信号;Receiving a first voice signal input by a user with a dialect indicating meaning;
    从所述第一语音信号中解析出需要进行语音识别的第一方言;Parsing a first dialect that requires speech recognition from the first speech signal;
    向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
  13. 根据权利要求12所述的方法,其特征在于,在向服务器发送服务请求之前,所述方法还包括:The method of claim 12, wherein before the sending of the service request to the server, the method further comprises:
    若未能从所述第一语音信号中解析出所述第一方言,识别所述语音唤醒词所属的方言作为所述第一方言。If the first dialect is not parsed from the first voice signal, the dialect to which the voice wake-up word belongs is identified as the first dialect.
  14. 根据权利要求12或13所述的方法,其特征在于,所述从所述第一语音信号中解析出需要进行语音识别的第一方言,包括:The method according to claim 12 or 13, wherein the parsing the first dialect that needs speech recognition from the first speech signal comprises:
    基于声学模型将所述第一语音信号转换为第一音素序列;Converting the first speech signal into a first phoneme sequence based on an acoustic model;
    将不同方言名称对应的音素片段分别在所述第一音素序列中进行匹配;Matching phoneme segments corresponding to different dialect names in the first phoneme sequence;
    当在所述第一音素序列中匹配中音素片段时,将所述匹配中的音素片段对应的方言作为所述第一方言。When the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the match is used as the first dialect.
  15. 一种终端设备,其特征在于,包括:存储器、处理器以及通信组件;A terminal device, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    通过所述通信组件向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
    所述通信组件,用于接收所述语音唤醒词,向所述服务器发送所述服务请求以及所述待识别语音信号。The communication component is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
  16. 一种服务器,其特征在于,包括:存储器、处理器以及通信组件;A server, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收终端设备发送的服务请求,所述服务请求指示选择第一方言对应的ASR模型;Receiving, by the communication component, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;
    从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,所述第一方言是所述终端设备接收的语音唤醒词所属的方言;Selecting, from the ASR model corresponding to the different dialects, the ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word received by the terminal device belongs;
    通过所述通信组件接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
    所述通信组件,用于接收所述服务请求和所述待识别语音信号。The communication component is configured to receive the service request and the to-be-identified voice signal.
  17. 一种终端设备,其特征在于,包括:存储器、处理器以及通信组件;A terminal device, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
    通过所述通信组件向服务器发送所述语音唤醒词,以供服务器基于所述语音唤醒词从不同方言对应的ASR模型中选择所述语音唤醒词所属第一方言对应的ASR模型;Sending, by the communication component, the voice wake-up words to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;
    通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
    所述通信组件,用于接收所述语音唤醒词,向所述服务器发送所述语音唤醒词和所述待识别语音信号。The communication component is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server.
  18. 一种服务器,其特征在于,包括:存储器、处理器以及通信组件;A server, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收终端设备发送的语音唤醒词;Receiving, by the communication component, a voice wake-up word sent by the terminal device;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    通过所述通信组件接收所述终端设备发送的待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别;Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;
    所述通信组件,用于接收所述语音唤醒词以及所述待识别语音信号。The communication component is configured to receive the voice wake-up word and the to-be-identified voice signal.
  19. 一种电子设备,其特征在于,包括:存储器、处理器以及通信组件;An electronic device, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收语音唤醒词;Receiving a voice wake-up word through the communication component;
    识别所述语音唤醒词所属的第一方言;Identifying a first dialect to which the voice wake-up word belongs;
    从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;
    利用所述第一方言对应的ASR模型对待识别语音信号进行语音识别;Performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect;
    所述通信组件,用于接收所述语音唤醒词。The communication component is configured to receive the voice wake-up word.
  20. 一种终端设备,其特征在于,包括:存储器、处理器以及通信组件;A terminal device, comprising: a memory, a processor, and a communication component;
    所述存储器,用于存储计算机程序;The memory for storing a computer program;
    所述处理器,与所述存储器耦合,用于执行所述计算机程序,以用于:The processor, coupled to the memory, for executing the computer program for:
    通过所述通信组件接收语音唤醒词,以唤醒语音识别功能;Receiving a voice wake-up word through the communication component to wake up the voice recognition function;
    通过所述通信组件接收用户输入的具有方言指示意义的第一语音信号;Receiving, by the communication component, a first voice signal input by a user with a dialect indicating meaning;
    从所述第一语音信号中解析出需要进行语音识别的第一方言;Parsing a first dialect that requires speech recognition from the first speech signal;
    通过所述通信组件向服务器发送服务请求,以请求所述服务器从不同方言对应的ASR模型中选择所述第一方言对应的ASR模型;Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;
    通过所述通信组件向所述服务器发送待识别语音信号,以供所述服务器利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别Sending, by the communication component, a voice signal to be identified to the server, for the server to perform voice recognition on the voice signal to be recognized by using an ASR model corresponding to the first dialect
    所述通信组件,用于接收所述语音唤醒词和所述第一语音信号,以及向所述服务器发送所述服务请求和所述待识别语音信号。The communication component is configured to receive the voice wake-up word and the first voice signal, and send the service request and the to-be-identified voice signal to the server.
  21. 一种存储有计算机程序的计算机可读存储介质,其特征在于,所述计算机程序被计算机执行时能够实现权利要求1-6任一项所述方法的步骤。A computer readable storage medium storing a computer program, wherein the computer program, when executed by a computer, is capable of implementing the steps of the method of any of claims 1-6.
  22. 一种存储有计算机程序的计算机可读存储介质,其特征在于,所述计算机程序被计算机执行时能够实现权利要求7-8任一项所述方法的步骤。A computer readable storage medium storing a computer program, wherein the computer program, when executed by a computer, is capable of implementing the steps of the method of any of claims 7-8.
  23. 一种语音识别系统,其特征在于,包括服务器和终端设备;A speech recognition system, comprising: a server and a terminal device;
    所述终端设备,用于接收语音唤醒词,识别所述语音唤醒词所属的第一方言,并向所述服务器发送服务请求,以及向所述服务器发送待识别语音信号,所述服务请求指示选择所述第一方言对应的ASR模型;The terminal device is configured to receive a voice wake-up word, identify a first dialect to which the voice wake-up word belongs, send a service request to the server, and send a to-be-identified voice signal to the server, where the service request indicates selection The ASR model corresponding to the first dialect;
    所述服务器,用于接收所述服务请求,根据所述服务请求的指示,从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,以及接收所述待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。The server is configured to receive the service request, select an ASR model corresponding to the first dialect, and receive the to-be-identified voice signal from an ASR model corresponding to different dialects according to the indication of the service request, and Performing speech recognition on the to-be-identified speech signal by using the ASR model corresponding to the first dialect.
  24. 一种语音识别系统,其特征在于,包括服务器和终端设备;A speech recognition system, comprising: a server and a terminal device;
    所述终端设备,用于接收语音唤醒词,向所述服务器发送所述语音唤醒词,以及向所述服务器发送待识别语音信号;The terminal device is configured to receive a voice wake-up word, send the voice wake-up word to the server, and send the to-be-identified voice signal to the server;
    所述服务器,用于接收所述语音唤醒词,识别所述语音唤醒词所属的第一方言,从不同方言对应的ASR模型中,选择所述第一方言对应的ASR模型,以及接收所述待识别语音信号,并利用所述第一方言对应的ASR模型对所述待识别语音信号进行语音识别。The server is configured to receive the voice wake-up word, identify a first dialect to which the voice wake-up word belongs, select an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects, and receive the waiting Identifying a voice signal, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
PCT/CN2018/114531 2017-11-17 2018-11-08 Speech recognition method, device and system WO2019096056A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711147698.X 2017-11-17
CN201711147698.XA CN109817220A (en) 2017-11-17 2017-11-17 Audio recognition method, apparatus and system

Publications (1)

Publication Number Publication Date
WO2019096056A1 true WO2019096056A1 (en) 2019-05-23

Family

ID=66539363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/114531 WO2019096056A1 (en) 2017-11-17 2018-11-08 Speech recognition method, device and system

Country Status (3)

Country Link
CN (1) CN109817220A (en)
TW (1) TW201923736A (en)
WO (1) WO2019096056A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102819A (en) * 2019-05-29 2020-12-18 南宁富桂精密工业有限公司 Voice recognition device and method for switching recognition languages thereof
CN112116909A (en) * 2019-06-20 2020-12-22 杭州海康威视数字技术股份有限公司 Voice recognition method, device and system
CN110364147B (en) * 2019-08-29 2021-08-20 厦门市思芯微科技有限公司 Awakening training word acquisition system and method
CN111091809B (en) * 2019-10-31 2023-05-23 国家计算机网络与信息安全管理中心 Regional accent recognition method and device based on depth feature fusion
CN110853643A (en) * 2019-11-18 2020-02-28 北京小米移动软件有限公司 Method, device, equipment and storage medium for voice recognition in fast application
CN110827799B (en) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
CN111081217B (en) * 2019-12-03 2021-06-04 珠海格力电器股份有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN111128125A (en) * 2019-12-30 2020-05-08 深圳市优必选科技股份有限公司 Voice service configuration system and voice service configuration method and device thereof
CN111724766B (en) * 2020-06-29 2024-01-05 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN112820296B (en) * 2021-01-06 2022-05-20 北京声智科技有限公司 Data transmission method and electronic equipment
CN113506565A (en) * 2021-07-12 2021-10-15 北京捷通华声科技股份有限公司 Speech recognition method, speech recognition device, computer-readable storage medium and processor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2814300A1 (en) * 2012-04-30 2013-10-30 Qnx Software Systems Limited Post processing of natural language asr
CN105223851A (en) * 2015-10-09 2016-01-06 韩山师范学院 Based on intelligent socket system and the control method of accent recognition
CN105957527A (en) * 2016-05-16 2016-09-21 珠海格力电器股份有限公司 Electric appliance speech control method and device and speech control air-conditioner
CN106452997A (en) * 2016-09-30 2017-02-22 无锡小天鹅股份有限公司 Household electrical appliance and control system thereof
CN106997762A (en) * 2017-03-08 2017-08-01 广东美的制冷设备有限公司 The sound control method and device of household electrical appliance
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275637B1 (en) * 2012-11-06 2016-03-01 Amazon Technologies, Inc. Wake word evaluation
CN104036774B (en) * 2014-06-20 2018-03-06 国家计算机网络与信息安全管理中心 Tibetan dialect recognition methods and system
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
CN106653031A (en) * 2016-10-17 2017-05-10 海信集团有限公司 Voice wake-up method and voice interaction device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2814300A1 (en) * 2012-04-30 2013-10-30 Qnx Software Systems Limited Post processing of natural language asr
CN105223851A (en) * 2015-10-09 2016-01-06 韩山师范学院 Based on intelligent socket system and the control method of accent recognition
CN105957527A (en) * 2016-05-16 2016-09-21 珠海格力电器股份有限公司 Electric appliance speech control method and device and speech control air-conditioner
CN106452997A (en) * 2016-09-30 2017-02-22 无锡小天鹅股份有限公司 Household electrical appliance and control system thereof
CN106997762A (en) * 2017-03-08 2017-08-01 广东美的制冷设备有限公司 The sound control method and device of household electrical appliance
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium

Also Published As

Publication number Publication date
TW201923736A (en) 2019-06-16
CN109817220A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
WO2019096056A1 (en) Speech recognition method, device and system
US11132172B1 (en) Low latency audio data pipeline
US20230333688A1 (en) Systems and Methods for Identifying a Set of Characters in a Media File
US11024307B2 (en) Method and apparatus to provide comprehensive smart assistant services
US11915699B2 (en) Account association with device
CN110140168B (en) Contextual hotwords
US10977299B2 (en) Systems and methods for consolidating recorded content
KR102196400B1 (en) Determining hotword suitability
CN110825340B (en) Providing a pre-computed hotword model
KR101689290B1 (en) Device for extracting information from a dialog
US10811005B2 (en) Adapting voice input processing based on voice input characteristics
CN110494841B (en) Contextual language translation
US11893350B2 (en) Detecting continuing conversations with computing devices
US10699706B1 (en) Systems and methods for device communications
KR102628211B1 (en) Electronic apparatus and thereof control method
WO2019045816A1 (en) Graphical data selection and presentation of digital content
WO2020052135A1 (en) Music recommendation method and apparatus, computing apparatus, and storage medium
CN112262432A (en) Voice processing device, voice processing method, and recording medium
KR20200082137A (en) Electronic apparatus and controlling method thereof
US20180350360A1 (en) Provide non-obtrusive output
JP6347939B2 (en) Utterance key word extraction device, key word extraction system using the device, method and program thereof
KR102584324B1 (en) Method for providing of voice recognition service and apparatus thereof
US11481188B1 (en) Application launch delay and notification
KR20210098250A (en) Electronic device and Method for controlling the electronic device thereof
US9251782B2 (en) System and method for concatenate speech samples within an optimal crossing point

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18879656

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18879656

Country of ref document: EP

Kind code of ref document: A1