WO2019096056A1

WO2019096056A1 - Speech recognition method, device and system

Info

Publication number: WO2019096056A1
Application number: PCT/CN2018/114531
Authority: WO
Inventors: 牛也; 徐巍越; 冯伟国; 黄光远
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2017-11-17
Filing date: 2018-11-08
Publication date: 2019-05-23
Also published as: TW201923736A; CN109817220A

Abstract

A speech recognition method, comprising: receiving a speech wake-up word (21); recognizing a first dialect to which the speech wake-up word belongs (22); sending to a server a service request to request the server to select an ASR model corresponding to the first dialect from ASR models corresponding to different dialects (23); and sending to the server a speech signal to be recognized to enable the server to perform speech recognition on the speech signal to be recognized using the ASR model corresponding to the first dialect (24). Speech recognition can be automatically performed on multiple dialects, the efficiency of speech recognition for multiple dialects are improved. Also provided is a speech recognition device and system.

Description

Speech recognition method, device and system

The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The present application relates to the field of voice recognition technologies, and in particular, to a voice recognition method, apparatus, and system.

Background technique

Automatic Speech Recognition (ASR) is a technology that converts human voice audio signals into text content. With the development of software and hardware technologies, the computing power and storage capacity of various smart devices have been greatly improved, making voice recognition technology widely used in smart devices.

In speech recognition technology, it is necessary to accurately recognize speech phonemes, and based on accurately recognized speech phonemes, it can be converted into text. However, regardless of the language, the language has many different pronunciations due to various factors, namely, multiple dialects. Taking Chinese as an example, there are many dialects such as Mandarin dialect, Jin dialect, Xiang dialect, Yi dialect, Wu dialect, Yi dialect, Cantonese dialect, and guest dialect. The pronunciation of different dialects is quite different.

At present, the speech recognition scheme for dialects is still not mature, and it is necessary to provide a solution to the multi-dialect problem.

Summary of the invention

Aspects of the present application provide a speech recognition method, apparatus, and system for automatically performing speech recognition on a plurality of dialects, and improving the efficiency of speech recognition for a plurality of dialects.

The embodiment of the present application provides a voice recognition method, which is applicable to a terminal device, and the method includes:

Receiving a speech wake-up word;

Identifying a first dialect to which the voice wake-up word belongs;

Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.

The embodiment of the present application further provides a voice recognition method, which is applicable to a server, and the method includes:

Receiving a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;

Selecting, in the ASR model corresponding to different dialects, an ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word belongs;

And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.

The embodiment of the present application further provides a voice recognition method, which is applicable to a terminal device, and the method includes:

Receiving a speech wake-up word;

Sending the voice wake-up word to the server, for the server to select an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;

Receiving a voice wake-up word sent by the terminal device;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

The embodiment of the present application further provides a voice recognition method, including:

Receiving a speech wake-up word;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;

The ASR model corresponding to the first dialect is used for speech recognition of the speech signal to be recognized.

Receiving a voice wake-up word to wake up the voice recognition function;

Receiving a first voice signal input by a user with a dialect indicating meaning;

Parsing a first dialect that requires speech recognition from the first speech signal;

The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Identifying a first dialect to which the voice wake-up word belongs;

Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.

The embodiment of the present application further provides a server, including: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving, by the communication component, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;

Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the service request and the to-be-identified voice signal.

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Sending, by the communication component, the voice wake-up words to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;

The communication component is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server.

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving, by the communication component, a voice wake-up word sent by the terminal device;

Identifying a first dialect to which the voice wake-up word belongs;

The communication component is configured to receive the voice wake-up word and the to-be-identified voice signal.

The embodiment of the present application further provides an electronic device, including: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Identifying a first dialect to which the voice wake-up word belongs;

Performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word.

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component to wake up the voice recognition function;

Receiving, by the communication component, a first voice signal input by a user with a dialect indicating meaning;

Sending, by the communication component, a voice signal to be identified to the server, for the server to perform voice recognition on the voice signal to be recognized by using an ASR model corresponding to the first dialect

The communication component is configured to receive the voice wake-up word and the first voice signal, and send the service request and the to-be-identified voice signal to the server.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, which is capable of implementing the steps in the foregoing first voice recognition method embodiment when the computer program is executed by a computer.

The embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed by a computer, the steps in the second voice recognition method embodiment can be implemented.

The embodiment of the present application further provides a voice recognition system, including a server and a terminal device;

The terminal device is configured to receive a voice wake-up word, identify a first dialect to which the voice wake-up word belongs, send a service request to the server, and send a to-be-identified voice signal to the server, where the service request indicates selection The ASR model corresponding to the first dialect;

The server is configured to receive the service request, select an ASR model corresponding to the first dialect, and receive the to-be-identified voice signal from an ASR model corresponding to different dialects according to the indication of the service request, and Performing speech recognition on the to-be-identified speech signal by using the ASR model corresponding to the first dialect. .

The terminal device is configured to receive a voice wake-up word, send the voice wake-up word to the server, and send the to-be-identified voice signal to the server;

The server is configured to receive the voice wake-up word, identify a first dialect to which the voice wake-up word belongs, select an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects, and receive the waiting Identifying a voice signal, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.

In the embodiment of the present application, the ASR model is constructed for different dialects, and in the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and faster to implement without manual operation by the user. It is beneficial to improve the efficiency of multi-language speech recognition.

DRAWINGS

The drawings described herein are intended to provide a further understanding of the present application, and are intended to be a part of this application. In the drawing:

FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application;

2 is a schematic flowchart of a voice recognition method according to another exemplary embodiment of the present application;

FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application;

4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application;

FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;

FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;

FIG. 7 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application;

FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to another exemplary embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a terminal device according to another exemplary embodiment of the present application;

FIG. 10 is a schematic structural diagram of another voice recognition apparatus according to another exemplary embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a server according to still another exemplary embodiment of the present application;

FIG. 12 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present application;

FIG. 13 is a schematic structural diagram of still another terminal device according to another exemplary embodiment of the present disclosure;

FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application;

FIG. 15 is a schematic structural diagram of another server according to another exemplary embodiment of the present disclosure;

FIG. 16 is a schematic structural diagram of still another module of a voice recognition apparatus according to still another exemplary embodiment of the present disclosure;

FIG. 17 is a schematic structural diagram of an electronic device according to still another exemplary embodiment of the present application.

Detailed ways

The technical solutions of the present application will be clearly and completely described in the following with reference to the specific embodiments of the present application and the corresponding drawings. It is apparent that the described embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

In the prior art, the speech recognition scheme for the dialect is not mature. For the technical problem, the embodiment of the present application provides a solution, and the main idea of the solution is to construct an ASR model for different dialects in the process of speech recognition. Pre-identifying the dialect to which the speech wake-up word belongs, and then selecting the ASR model corresponding to the dialect to which the speech wake-up word belongs from the ASR models corresponding to different dialects, and using the selected ASR model to perform speech recognition on the subsequent to-be-recognized speech signal, thereby realizing more The dialect speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and fast to implement without manual operation by the user, and is beneficial to improve the efficiency of multi-dial speech recognition.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic structural diagram of a voice recognition system according to an exemplary embodiment of the present application. As shown in FIG. 1, the speech recognition system 100 includes a server 101 and a terminal device 102. A communication connection is made between the server 101 and the terminal device 102.

For example, the terminal device 102 can communicate with the server 101 via the Internet, or can also communicate with the server 101 via a mobile network. If the terminal device 102 is in communication connection with the server 101 through the mobile network, the network standard of the mobile network may be 2G (GSM), 2.5G (GPRS), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE). Any of 4G+ (LTE+), WiMax, and the like.

The server 101 mainly provides an ASR model for different dialects, and selects a corresponding ASR model to perform speech recognition on the speech signals in the corresponding dialects. The server 101 can be any device that can provide computing services, can respond to service requests, and process, such as a conventional server, a cloud server, a cloud host, a virtual center, and the like. The composition of the server mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture.

In this embodiment, the terminal device 102 is mainly oriented to the user, and may provide an interface or portal for voice recognition to the user. The terminal device 102 can be implemented in various forms, such as a smart phone, a smart speaker, a personal computer, a wearable device, a tablet computer, and the like. The terminal device 102 typically includes at least one processing unit and at least one memory. The number of processing units and memories depends on the configuration and type of terminal device 102. The memory may include volatile, such as RAM, and may also include non-volatile, such as Read-Only Memory (ROM), flash memory, etc., or both. An operating system (OS), one or more applications, and program data are stored in the memory. In addition to the processing unit and memory, the terminal device 102 also includes some basic configurations, such as a network card chip, an IO bus, an audio and video component (such as a microphone), and the like. Optionally, the terminal device 102 may also include some peripheral devices such as a keyboard, a mouse, a stylus, a printer, and the like. These peripheral devices are well known in the art and will not be described herein.

In this embodiment, the terminal device 102 and the server 101 cooperate with each other to provide a voice recognition function to the user. In addition, it is considered that in some cases, the terminal device 102 may be used by multiple users, and multiple users may hold different dialects. Taking Chinese as an example, the geographical division can include the following types of dialects: Mandarin dialect, Jin dialect, Xiang dialect, Yi dialect, Wu dialect, Yi dialect, Cantonese, and Hakka. Further, some dialects can also be subdivided. For example, proverbs can include Minbei dialect, Minnan dialect, Mindong dialect, Suizhong dialect, and Zhuxian dialect. The pronunciation of different dialects is quite different, and the same ASR model cannot be used for speech recognition. Therefore, in the present embodiment, the ASR models are separately constructed for different dialects in order to perform speech recognition on different dialects. Further, based on the cooperation between the terminal device 102 and the server 101, a voice recognition function can be provided to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.

In order to improve the voice recognition efficiency, the terminal device 102 supports the voice wake-up word function, that is, when the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device 102 to wake up the voice recognition function. The voice wake-up word is a voice signal specifying a text content, and may be, for example, "on", "Tmall Elf", "hello", and the like. The terminal device 102 receives the voice wake-up word input by the user, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs), and adopts the ASR model corresponding to the corresponding dialect. Provide a basis for speech recognition. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect. The first dialect to which the voice wake-up word belongs may be any dialect in any language.

After identifying the first dialect to which the voice wake-up word belongs, the terminal device 102 may send a service request to the server 101, the service request instructing the server 101 to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. The server 101 receives the service request sent by the terminal device 102, and then selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects according to the indication of the service request, so as to perform subsequent speech recognition based on the ASR model corresponding to the first dialect. The signal is speech recognized. In this embodiment, the server 101 stores in advance an ASR model corresponding to different dialects. The ASR model is a model that converts speech signals into text. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.

After transmitting the service request to the server 101, the terminal device 102 continues to send the to-be-identified voice signal to the server 101, and the to-be-identified voice signal belongs to the first dialect. The server 101 receives the to-be-recognized speech signal sent by the terminal device 102, and performs speech recognition on the speech signal to be recognized according to the ASR model corresponding to the selected first dialect, not only can perform speech recognition on the first dialect, but also adopts the matching ASR model. Conducive to improve the accuracy of speech recognition.

Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 102 after inputting the voice wake-up word. Based on this, the terminal device 102 may further receive the user input before transmitting the to-be-identified voice signal to the server 101. The speech signal to be recognized. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device 102, based on which the terminal device 102 may directly acquire the voice signal to be recognized from the local.

In some exemplary embodiments, the server 101 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 102. For example, the server 101 may return the text content recognized by the voice to the terminal device 102; or the server 101 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 102. The terminal device 102 receives the speech recognition result returned by the server 101 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result. For example, after receiving the text content recognized by the voice, the terminal device 102 may present the text content to the user, or may perform a network search or the like based on the text content. For example, after receiving the associated information of the voice recognition result, such as songs, videos, and the like, the terminal device 102 can play information such as songs and videos, or can forward information such as songs and videos to other users for information sharing. .

In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model is used to perform speech recognition on the subsequent speech signals to be recognized, and the multi-dial speech recognition is automated, and the ASR model of the corresponding dialect is automatically selected based on the speech wake-up words, which is more convenient and quick to implement without manual operation by the user. Conducive to improving the efficiency of multi-language speech recognition.

Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of recognizing multi-dial speech.

In the embodiments of the present application, the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs is not limited, and any manner in which the first dialect to which the voice wake-up word belongs can be applied to the embodiments of the present application. In some of the following exemplary embodiments of the present application, several ways in which the terminal device 102 recognizes the dialect to which the speech wake-up word belongs are listed:

In the first mode, the terminal device 102 dynamically matches the voice wake-up words with the reference wake-up words recorded in different dialects, and obtains a dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement. One side.

In mode 1, the reference wake-up words are recorded in advance in different dialects. Among them, the reference wake-up words recorded in different dialects are the same as the text content of the voice wake-up words. Due to the different vocalization mechanisms of users with different dialects, the acoustic characteristics of the benchmark keywords recorded in different dialects are different. Based on this, the terminal device 102 pre-records the reference wake-up words in different dialects. After receiving the voice wake-up words input by the user, the voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects to obtain the The matching degree of different benchmark wake-up words. The first setting requirement may be different according to different application scenarios. For example, a dialect corresponding to the reference wake-up word with the highest degree of matching with the voice wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the reference with the voice wake-up word is greater than the matching threshold. The dialect corresponding to the word is used as the first dialect; or a matching degree range may be set, and the dialect corresponding to the reference wake-up word falling within the matching degree range with the matching degree of the voice wake-up word is used as the first dialect.

In mode 1, the acoustic features may be embodied as time domain features and frequency domain features of the speech signal. There are various matching methods based on time domain features and frequency domain features. Alternatively, dynamic matching of the speech wake words can be performed based on dynamic time warping (DTW) methods.

The dynamic time bending method is a method of measuring the similarity between two time series. The terminal device 102 generates a time series of the speech wake-up words according to the input speech wake-up words, and compares them with the time series of the reference wake-up words recorded in different dialects, respectively. At least one pair of similarities is determined between the two time series participating in the comparison. The similarity between two time series is measured by the sum of the distances between similar points, that is, the distance of the normalized path. Optionally, the dialect corresponding to the reference wake-up word with the smallest distance from the regular path of the voice wake-up word may be used as the first dialect; or a distance threshold may be set to wake up the reference with the regular path distance of the voice wake-up word less than the distance threshold The dialect corresponding to the word is used as the first dialect; a distance range may be set, and the dialect corresponding to the reference wake-up word falling within the distance range from the regular path distance of the voice wake-up word is used as the first dialect.

In the mode 2, the terminal device 102 recognizes the acoustic features of the voice wake-up words, and matches the acoustic features of the voice wake-up words with the acoustic features of different dialects respectively, and obtains a dialect that matches the acoustic features of the voice wake-up words according to the second setting requirement. As the first dialect.

In mode 2, the acoustic features of different dialects are acquired in advance, and the first dialect to which the speech wake-up word belongs is determined based on the acoustic features of the awakened words of the speech, and then based on the matching between the acoustic features.

Alternatively, the speech wake words may be filtered and digitized prior to identifying the acoustic features of the speech wake words. The filtering process refers to the preservation of the signal in the speech wake-up word with a frequency between 300 and 3400 Hz. Digitization refers to A/D conversion and anti-aliasing processing of reserved signals.

Alternatively, the acoustic features of the speech wake words may be identified by calculating spectral feature parameters of the speech wake words, such as sliding differential cepstral parameters. Similar to mode 1, the second setting requirement may be different depending on the application scenario. For example, a dialect corresponding to the reference wake-up word with the highest degree of matching of the acoustic features of the voice wake-up word may be used as the first dialect; a matching degree threshold may also be set, and the matching degree with the acoustic feature of the voice-awaken word is greater than the matching degree. The dialect corresponding to the reference wake-up word of the threshold is used as the first dialect; and a matching range is set, and the dialect corresponding to the reference wake-up word falling within the matching degree range is matched with the dialect of the acoustic feature of the speech wake-up word. One side.

Among them, the sliding differential cepstrum parameter is composed of several blocks of differential cepstrum across multiple frames of speech. Considering the influence of frame difference cepstrum before and after, more timing features are incorporated. Comparing the sliding differential cepstrum parameter of the reference wake-up word with the sliding differential cepstrum parameter of the reference wake-up word recorded in different dialects, optionally corresponding to the reference wake-up word with the highest matching degree of the sliding differential cepstral parameter of the reference wake-up word The dialect is used as the first dialect; a parameter difference threshold may also be set, and the dialect corresponding to the speech wake-up word whose difference between the sliding differential cepstral parameters of the reference wake-up words is less than the parameter difference threshold is used as the first dialect; The difference range is a dialect corresponding to the reference wake-up word falling within the parameter difference range from the difference differential cepstral parameter of the reference wake-up word as the first dialect.

In the third mode, the voice wake-up words are converted into text wake-up words, and the text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the reference text wake-up words corresponding to the third set requirement are obtained. The dialect is the first dialect.

In the mode 3, the text wake-up word is a text converted by the voice wake-up word after the voice recognition, and the reference text wake-up word corresponding to the different dialects is the text converted into the reference wake-up speech corresponding to the different dialects. Optionally, for the text wake-up words and the reference text wake-up words corresponding to different dialects, the same speech recognition model may be used for rough speech recognition to improve the efficiency of the entire speech recognition process. Alternatively, the ASR model corresponding to different dialects may be used to perform voice recognition on the reference wake-up words corresponding to different dialects and convert them into corresponding reference text wake-up words. After receiving the voice wake-up words, the ASR corresponding to one dialect may be selected in turn. Model, and based on the selected ASR model, speech recognition of the speech wake-up words to obtain a text wake-up word, and matching the converted text wake-up words with the reference text wake-up words corresponding to the dialect, if the dialect corresponds to the benchmark If the matching degree between the text wake-up word and the text wake-up word meets the third setting requirement, the dialect is used as the first dialect. On the other hand, if the matching degree between the reference text wake-up word and the text wake-up word corresponding to the dialect does not meet the third setting requirement, then the text wake-up word is continuously recognized according to the ASR model corresponding to the next dialect and then converted into text wake-up. a word, and matching the converted text wake-up word with the reference text wake-up word corresponding to the dialect, until obtaining a reference text wake-up word that matches the text wake-up word according to the third setting requirement, and wakes up the reference text The corresponding dialect is used as the first dialect to which the speech wake-up word belongs.

Optionally, similar to the mode 1, the mode 2, the dialect corresponding to the reference text wake-up word with the highest matching degree of the text wake-up word may be used as the first dialect; or a matching degree threshold may be set, and the text wake-up word is The dialect corresponding to the reference text wake-up word whose matching degree is greater than the matching degree threshold is used as the first dialect; and a matching degree range may be set, corresponding to the reference text wake-up word falling within the matching degree range with the matching degree of the text wake-up word The dialect is the first dialect.

It should be noted that the first setting requirement, the second setting requirement, and the third setting requirement may be the same or different.

In some exemplary embodiments, the terminal device 102 is a device with a display screen, such as a mobile phone, a computer, a wearable device, etc., and may display a voice input interface on the display screen, and obtain text information input by the user through the voice input interface. voice signal. Alternatively, when the user needs to perform voice recognition, an instruction to turn on or activate may be sent to the terminal device 102 by pressing an open button of the terminal device or a display screen of the touch terminal device 102. The terminal device 102 can present a voice input interface to the user on the display in response to an instruction to activate or turn on itself. Optionally, an icon of the microphone or text information like “Wake Up Word Input” may be displayed on the voice input interface to instruct the user to input the voice wake up word. Further, the terminal device 102 can acquire a voice wake-up word input by the user based on the voice input interface.

In some exemplary embodiments, the terminal device 102 may be a device with a voice playback function such as a mobile phone, a computer, a smart speaker, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as "please speak", "please order", etc., may be output to prompt The user makes a voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the voice input prompt tone. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.

In other exemplary embodiments, the terminal device 102 may be a device having a display screen such as a mobile phone, a computer, a wearable device, or the like. Based on this, after the terminal device 102 sends the service request to the server 101, and before transmitting the to-be-recognized voice signal to the server 101, the voice input prompt information, such as a text like "speak", a microphone icon, may be displayed in a text or icon. Etc., to prompt the user for voice input. For the user, after the voice wake-up word is input, the voice signal to be recognized may be input to the terminal device 102 under the prompt of the voice input prompt information. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.

In still other exemplary embodiments, the terminal device 102 may have an indicator light. Based on this, after the terminal device 102 transmits the service request to the server 101, and before transmitting the voice signal to be recognized to the server 101, the indicator light can be illuminated to prompt the user to perform voice input. For the user, after inputting the voice wake-up word, the voice signal to be recognized may be input to the terminal device 102 at the prompt of the indicator light. The terminal device 102 receives the to-be-identified voice signal input by the user, and sends the to-be-identified voice signal to the server 101. The server 101 performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the first dialect.

It should be noted that the terminal device 102 can simultaneously have at least two or three of a voice playing function, an indicator light, and a display screen. Based on this, the terminal device 102 can simultaneously output the voice input prompt information in two or three of an audio manner, a text or an icon manner, and a manner of lighting the indicator light, thereby enhancing the interaction effect with the user.

In some exemplary embodiments, before outputting the voice input prompt tone or outputting the voice input prompt information or lighting the indicator light, the terminal device 102 may predetermine that the server 101 has selected the ASR model corresponding to the first dialect so as to facilitate the user. After the input voice signal to be recognized is transmitted to the server 101, the server 101 can directly recognize the voice signal to be recognized according to the selected ASR model. Based on this, after selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the server 101 returns a notification message to the terminal device 102, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected. Based on this, the terminal device 102 can also receive the notification message returned by the server 101, and further, based on the notification message, learn that the server 101 has selected the ASR model corresponding to the first dialect. Further, after receiving the notification message returned by the server 101, the terminal device 102 may output a voice input prompt tone, or output a voice input prompt message, or light an indicator light to prompt the user to perform voice input.

In the embodiments of the present application, before selecting the ASR model corresponding to the first dialect, the server 101 needs to construct an ASR model corresponding to different dialects. The process of the server 101 constructing the ASR model corresponding to different dialects mainly includes: collecting corpus of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect corresponding according to acoustic characteristics of different dialects ASR model. For detailed procedures for constructing an ASR model corresponding to each dialect, refer to the prior art, and details are not described herein again.

Alternatively, the corpus of different dialects may be collected through the network, or a large number of users holding different dialects may be voice recorded to obtain corpus of different dialects.

Optionally, the collected corpus of different dialects may be pre-processed before feature extraction of corpora of different dialects. The preprocessing process includes pre-emphasis processing, windowing processing, and endpoint detection processing on the voice. After preprocessing the corpus of different dialects, feature extraction can be performed on the speech. Features of speech include time domain features and frequency domain features. The time domain features include short-term average energy, short-term average zero-crossing rate, formant, pitch period, etc. The frequency domain features include linear prediction coefficients, LPC cepstral coefficients, line spectrum pair parameters, short-term spectrum, and Mel frequency. Spectral coefficient, etc.

Next, the process of extracting the acoustic features will be described by taking the Mel frequency cepstrum coefficient as an example. Firstly, using the perceptual characteristics of the human ear, several band-pass filters are set in the spectral range of the speech. Each band-pass filter has a triangular or sinusoidal filtering characteristic, and then the eigenvectors obtained by filtering the corpus in the bandpass filter are used. The energy information is included, the signal energy of several bandpass filters is calculated, and the Mel frequency cepstrum coefficient is calculated by discrete cosine transform.

After obtaining the acoustic features of different dialects, the acoustic features of different dialects are used as input, and the text corresponding to the corpus of different dialects is output as the output, and the parameters in the initial model corresponding to different dialects are trained to obtain the ASR models corresponding to different dialects. Optionally, the ASR model includes, but is not limited to, a model constructed based on vector quantization, a neural network model, and the like.

In the following, the above embodiment is described in detail by taking an application scenario in which a plurality of dialects use a terminal device to perform a song by using a terminal device as an example.

The terminal device with the song-song function may be a smart speaker. Optionally, the smart speaker has a display screen, and the preset voice wake-up word of the smart speaker is “hello”. When a Cantonese user holding a Cantonese dialect wants to sing a song, the Cantonese user first touches the display screen to input an instruction to activate the smart speaker, and the smart speaker displays a voice input interface on the display screen in response to an instruction to activate the terminal device, and the voice input interface There is a "Hello" text on it. The Cantonese user inputs a "hello" voice signal to the voice input interface. The intelligent speaker acquires the "hello" voice signal input by the user based on the voice input interface, and recognizes that "hello" belongs to the Cantonese dialect; then, sends a service request to the server to request the server to select the Cantonese dialect from the ASR model corresponding to different dialects. Corresponding ASR model. After receiving the service request, the server selects the ASR model corresponding to the Cantonese dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Cantonese dialect has been selected. Then, the smart speaker outputs a voice input prompt message, such as "please input voice" to prompt the user to input voice. The Cantonese user enters the voice signal of the song name “Five Star Red Flag” at the prompt of the voice input prompt message. The intelligent speaker receives the voice signal “Five Star Red Flag” input by the Cantonese user, and sends the voice signal “Five Star Red Flag” to the server. The server uses the ASR model corresponding to the Cantonese dialect to perform speech recognition on the voice signal "five-star red flag" to obtain the text information "five-star red flag", and deliver the song matching the "five-star red flag" to the smart speaker for the smart speaker to play the song. .

Similarly, after the end of the Cantonese user's song in Cantonese dialect, it is assumed that Tibetan users with Tibetan dialects want to order songs. At this time, the Tibetan user can input the “hello” voice signal on the voice input interface displayed by the smart speaker. The smart speaker recognizes "Hello" as a Tibetan dialect; then, sends a service request to the server to request the server to select the ASR model corresponding to the Tibetan dialect from the ASR models corresponding to different dialects. After receiving the service request, the server selects the ASR model corresponding to the Tibetan dialect, and returns a notification message to the smart speaker, the notification message is used to indicate that the ASR model corresponding to the Tibetan dialect has been selected. Then, the smart speaker outputs a voice input prompt message, such as "please input voice" to prompt the user to input voice. The Tibetan user enters the voice signal of the song name "My Country" at the prompt of the voice input prompt message. The smart speaker receives the voice signal "My Country" input by the user and sends the voice signal "My Country" to the server. The server uses the ASR model corresponding to the Tibetan dialect to perform speech recognition on the voice signal "My Country" to obtain the text message "My Country", and deliver the song matching "My Country" to the smart speaker for intelligence. The speaker plays the song.

In the application scenario, the voice recognition method provided by the embodiment of the present application is used. When a user holding a different dialect uses the same smart speaker to sing a song, the user does not need to manually switch the ASR model, and only needs to input the voice wake-up word in the corresponding dialect. The intelligent sound can automatically recognize the dialect to which the voice wake-up word belongs and then request the server to start the song name corresponding to the ASR model corresponding to the corresponding dialect, and improve the efficiency of the song while supporting the multi-dial automatic song.

FIG. 2 is a schematic flowchart diagram of a voice recognition method according to another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the terminal device. As shown in Figure 2, the method includes:

21. Receive voice wake-up words.

22. Identify the first dialect to which the voice wake-up word belongs.

23. Send a service request to the server to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

24. The voice signal to be identified is sent to the server, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.

When the user wants to perform voice recognition, a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying the text content, such as "on", "Tmall Elf", "hello", and the like. The terminal device receives the voice wake-up word input by the user, identifies a dialect to which the voice wake-up word belongs, and further determines a dialect to which the subsequent voice signal to be recognized belongs (ie, a dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect. Speech recognition provides the foundation. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.

Then, after identifying the first dialect to which the voice wake-up word belongs, the terminal device sends a service request to the server, and the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. Then, the terminal device transmits the to-be-identified voice signal to the server. After receiving the service request, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, and identifies the received voice signal to be recognized through the selected ASR model corresponding to the first dialect.

In this embodiment, the terminal device identifies the first dialect to which the voice wake-up word belongs, and sends a service request to the server, so that the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the correspondence based on the first dialect. The ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-language speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and faster to implement, and is beneficial to improve. The efficiency of multi-uttered speech recognition. Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. Further improve the efficiency of identifying the recognized speech.

In some exemplary embodiments, one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect. Alternatively, another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects respectively, and obtaining the matching degree with the acoustic features of the voice wake-up words according to the second Set the required dialect as the first dialect. Alternatively, the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and matching the text wake-up word with the reference text wake-up word corresponding to different dialects respectively, acquiring and text The dialect of the wake-up word conforms to the dialect corresponding to the reference text wake-up word of the third setting requirement as the first dialect.

In some exemplary embodiments, one manner of receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.

In some exemplary embodiments, before transmitting the to-be-identified voice signal to the server, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the voice signal to be recognized input by the user.

In some exemplary embodiments, before outputting the voice input prompt information, the method further includes: receiving a notification message returned by the server, the notification message being used to indicate that the ASR model corresponding to the first dialect has been selected.

FIG. 3 is a schematic flowchart diagram of another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 1, mainly from the perspective of the server. As shown in FIG. 3, the method includes:

31. Receive a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected.

32. From the ASR model corresponding to different dialects, select the ASR model corresponding to the first dialect, and the first dialect is the dialect to which the speech wake-up word belongs.

33. Receive a to-be-identified speech signal sent by the terminal device, and perform speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.

In this embodiment, after identifying the first dialect to which the voice wake-up word belongs, the terminal device sends a service request to the server. According to the service request, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing multi-utter speech recognition. The automation, and based on the voice wake-up words automatically select the ASR model of the corresponding dialect, without the user manual operation, the implementation is more convenient and fast, and is conducive to improving the efficiency of multi-dial speech recognition.

Further, the process of recognizing the dialect to which the speech wake-up word belongs is relatively short, so that the speech recognition system can quickly recognize the first dialect to which the speech wake-up word belongs, and select the ASR model corresponding to the first dialect. To further improve the efficiency of multi-language speech recognition.

In some exemplary embodiments, the server needs to construct an ASR model corresponding to different dialects before selecting the ASR model corresponding to the first dialect. Among them, a process of constructing ASR models corresponding to different dialects mainly includes: collecting corpora of different dialects; extracting features of corpora of different dialects to obtain acoustic features of different dialects; constructing different dialect correspondence according to acoustic characteristics of different dialects ASR model.

In some exemplary embodiments, after the speech recognition of the speech signal to be recognized based on the ASR model corresponding to the first dialect, the speech recognition result or the association information of the speech recognition result may be transmitted to the terminal device for the terminal device to be based on the speech recognition. The result or the associated information of the speech recognition result performs subsequent processing.

FIG. 4 is a schematic structural diagram of another voice recognition system according to still another exemplary embodiment of the present application. As shown in FIG. 4, the speech recognition system 400 includes a server 401 and a terminal device 402. A communication connection is made between the server 401 and the terminal device 402.

The architecture of the speech recognition system 400 provided in this embodiment is the same as that of the speech recognition system 100 shown in FIG. 1, except that the functions of the server 401 and the terminal device 402 in the speech recognition process are different. For the implementation of the terminal device 402 and the server 401 in FIG. 4 and the communication connection manner, refer to the description of the embodiment shown in FIG. 1 , and details are not described herein again.

Similar to the speech recognition system 100 shown in FIG. 1, in the speech recognition system 400 shown in FIG. 4, the terminal device 402 and the server 401 cooperate with each other, and the speech recognition function can also be provided to the user. Moreover, considering that in some cases, the terminal device 402 may be used by multiple users, multiple users may hold different dialects, and thus, in the voice recognition system 400, the ASR model is separately constructed for different dialects, and further, based on the terminal. The cooperation between the device 402 and the server 401 can provide a voice recognition function to users holding different dialects, that is, voice recognition can be performed on voice signals of users holding different dialects.

In the voice recognition system 400 shown in FIG. 4, the terminal device 402 also supports the voice wake-up word function, but the terminal device 402 is mainly used to receive the voice wake-up words input by the user and report it to the server 401 for the server 401 to identify the voice wake-up word. In the dialect, this is different from the terminal device 102 in the embodiment shown in FIG. Correspondingly, in the speech recognition system 400 shown in FIG. 4, the server 401 provides an ASR model for different dialects and selects a corresponding ASR model for speech recognition of the speech signal in the corresponding dialect, and also has a dialect for identifying the speech awakening word. Features.

Based on the voice recognition system 400 shown in FIG. 4, when the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device 402, and the voice wake-up word is a voice signal specifying the text content, such as "open", "Tmall Elf" ", "hello" and so on. The terminal device 402 receives the voice wake-up word input by the user, and transmits the voice wake-up word to the server 401. After receiving the voice wake-up word sent by the terminal device 402, the server 401 identifies the dialect to which the voice wake-up word belongs. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect. The first dialect refers to a dialect to which the awakening word belongs, and may be, for example, a Mandarin dialect, a Jin dialect or a Xiang dialect. Then, the server 401 selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the voice signals in the first dialect based on the ASR model corresponding to the first dialect. In this embodiment, the server 401 stores in advance an ASR model corresponding to different dialects. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.

After transmitting the voice wake-up word to the server 401, the terminal device 402 continues to transmit the to-be-identified voice signal to the server 401. The server 401 receives the to-be-identified speech signal sent by the terminal device 402, and performs speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect. Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device 402 after inputting the voice wake-up word. Based on this, the terminal device 402 may further receive the user input before transmitting the to-be-identified voice signal to the server 401. The speech signal to be recognized. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device 402.

In some exemplary embodiments, the manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquires and wakes up the voice. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.

In other exemplary embodiments, another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: matching the acoustic features of the voice wake-up words with the acoustic features of the different dialects, respectively, and acquiring the awakened words with the voice The dialect of the acoustic feature matches the dialect of the second setting requirement as the first dialect.

In still another exemplary embodiment, another manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and awakening the text wake-up word to the reference text corresponding to the different dialect respectively The words are matched, and the dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement is obtained as the first dialect.

The manner in which the server 401 identifies the first dialect to which the voice wake-up word belongs is similar to the manner in which the terminal device 102 recognizes the first dialect to which the voice wake-up word belongs. For detailed description, refer to the foregoing embodiment, and details are not described herein again.

In some exemplary embodiments, the manner in which the terminal device 402 receives the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; acquiring a voice wake-up word input by the user based on the voice input interface.

In some exemplary embodiments, before transmitting the to-be-identified voice signal to the server 401, the terminal device 402 may output voice input prompt information to prompt the user to perform voice input; and thereafter, receive the voice signal to be recognized input by the user.

In some exemplary embodiments, before outputting the voice input prompt information, the terminal device 402 may receive a notification message returned by the server 401, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected. Based on this, after determining that the server 401 has selected the ASR model corresponding to the first dialect, the terminal device 402 may output voice input prompt information to the user to prompt the user to perform voice input, so that the voice signal to be recognized input by the user may be sent to The server 401 after the server 401 can directly recognize the speech signal to be recognized according to the selected ASR model.

In some exemplary embodiments, the server 401 may collect the predictions of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extract the features of different dialects to obtain different dialects. Acoustic characteristics; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects. For detailed procedures for constructing an ASR model corresponding to each dialect, refer to the prior art, and details are not described herein again.

In some exemplary embodiments, the server 401 may return the associated information of the speech recognition result or the speech recognition result to the terminal device 402. For example, the server 401 may return the text content recognized by the voice to the terminal device 402; or the server 401 may return information such as songs, videos, and the like that match the voice recognition result to the terminal device 402. The terminal device 402 receives the speech recognition result returned by the server 401 or the association information of the speech recognition result, and performs subsequent processing based on the speech recognition result or the association information of the speech recognition result.

FIG. 5 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in FIG. 4, mainly from the perspective of the terminal device. As shown in FIG. 5, the method includes:

51. Receive a speech wake up word.

52. Send a voice wake-up word to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR model corresponding to different dialects based on the voice wake-up word.

53. Send a voice signal to be identified to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.

When the user wants to perform voice recognition, a voice wake-up word may be input to the terminal device, and the voice wake-up word is a voice signal specifying a text content, such as "on", "Tmall Elf", "hello", and the like. The terminal device receives the voice wake-up word sent by the user, and sends a voice wake-up word to the server, so that the server identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent voice signal to be recognized belongs (ie, the dialect to which the voice wake-up word belongs ), to provide a basis for speech recognition using the ASR model corresponding to the corresponding dialect. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.

Then, the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR model corresponding to the different dialects according to the first dialect to which the voice wake-up word belongs. Then, the terminal device continues to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.

In some exemplary embodiments, the receiving the voice wake-up word includes: presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquiring a voice wake-up word input by the user based on the voice input interface.

FIG. 6 is a schematic flowchart diagram of still another voice recognition method according to still another exemplary embodiment of the present application. This embodiment can be implemented based on the speech recognition system shown in Fig. 4, mainly from the perspective of the server. As shown in FIG. 6, the method includes:

61. Receive a voice wake-up word sent by the terminal device.

62. Identify a first dialect to which the voice wake-up word belongs.

63. Select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

64. Receive a voice signal to be recognized sent by the terminal device, and perform voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.

The server receives the voice wake-up word sent by the terminal device, identifies the dialect to which the voice wake-up word belongs, and further determines the dialect to which the subsequent to-be-identified voice signal belongs (that is, the dialect to which the voice wake-up word belongs), and performs the ASR model corresponding to the corresponding dialect. Speech recognition provides the foundation. For convenience of description and distinction, the dialect to which the speech wake-up word belongs is recorded as the first dialect.

Then, the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects stored in advance, and then performs voice recognition on the subsequent speech signals based on the ASR model corresponding to the first dialect, thereby realizing the automation of multi-utter speech recognition. And automatically select the ASR model of the corresponding dialect based on the voice wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to improve the efficiency of multi-dial speech recognition.

In some exemplary embodiments, one manner of identifying the first dialect to which the voice wake-up word belongs includes: dynamically matching the voice wake-up words with the reference wake-up words recorded in different dialects, and acquiring the sound wake-up words. The dialect corresponding to the reference wake-up word of the first setting requirement is used as the first dialect.

In other exemplary embodiments, another manner of identifying the first dialect to which the voice wake-up word belongs includes: matching acoustic features of the voice wake-up words with acoustic features of different dialects respectively, and acquiring acoustics with the voice wake-up words The dialect of the feature matching the second setting requirement is used as the first dialect.

In still another exemplary embodiment, the foregoing manner of recognizing the first dialect to which the voice wake-up word belongs includes: converting the voice wake-up word into a text wake-up word, and the text wake-up word respectively corresponding to the dialect word corresponding to different dialects The matching is performed to obtain a dialect corresponding to the reference text wake-up word whose matching degree with the text wake-up word meets the third setting requirement as the first dialect.

In some exemplary embodiments, before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain Acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.

In some exemplary embodiments, the server may return the speech recognition result or the association information of the speech recognition result to the terminal device. For example, the server may return the text content recognized by the voice to the terminal device; or, the song, the video, and the like that match the voice recognition result may be returned to the terminal device.

In the above embodiments, the speech recognition of the multi-dial is performed by the terminal device and the server, but is not limited thereto. For example, if the processing function and the storage function of the terminal device or the server are sufficiently powerful, the multi-word speech recognition function can be separately integrated on the terminal device or the server. Based on this, still another exemplary embodiment of the present application provides a voice recognition method independently implemented by a server or a terminal device. For simplicity of description, in the following embodiments, the server and the terminal device are collectively referred to as an electronic device. As shown in FIG. 7, the voice recognition method independently implemented by the server or the terminal device includes the following steps:

71. Receive a speech wake-up word.

72. Identify a first dialect to which the voice wake-up word belongs.

73. Select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

74. Perform speech recognition by using the ASR model corresponding to the first dialect to identify the speech signal.

When the user wants to perform voice recognition, a voice wake-up word may be input to the electronic device, and the voice wake-up word is a voice signal specifying the text content, such as "on", "Tmall Elf", "hello", and the like. The electronic device receives the voice wake-up word sent by the user, and identifies the first dialect to which the voice wake-up word belongs. Among them, the first dialect refers to the dialect to which the awakening words of speech belong, such as Mandarin dialect, Jin dialect, Xiang dialect and so on.

Then, the electronic device selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to perform voice recognition on the subsequent to-be-identified voice signals based on the ASR model corresponding to the first dialect. In this embodiment, the electronic device stores in advance an ASR model corresponding to different dialects. Optionally, an ASR model corresponding to a dialect, or several similar dialects, may also correspond to the same ASR model, which is not limited thereto. The ASR model corresponding to the first dialect is used to convert the voice signal of the first dialect into text content.

After selecting the ASR model corresponding to the first dialect, the electronic device uses the ASR model corresponding to the first dialect to perform speech recognition on the speech signal to be recognized. Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the electronic device after inputting the voice wake-up word, based on which the electronic device performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect. It is also possible to receive a speech signal to be recognized input by the user. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the electronic device, based on which the electronic device may directly obtain the voice signal to be recognized from the local.

In some exemplary embodiments, before performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect, the method further includes: outputting the voice input prompt information to prompt the user to perform voice input; and receiving the user input to be recognized. voice signal.

In some exemplary embodiments, before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: collecting corpora of different dialects; performing feature extraction on corpora of different dialects to obtain different The acoustic characteristics of dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.

In some exemplary embodiments, after performing speech recognition on the speech signal to be recognized based on the ASR model corresponding to the first dialect, the electronic device may perform subsequent processing based on the speech recognition result or the association information of the speech recognition result.

It should be noted that, in the above embodiment or the following embodiments of the present application, the voice wake-up word may be preset; or, the user may be allowed to customize the wake-up word. Here, the custom wake-up word or the preset wake-up word mainly refers to the content and/or tone of the wake-up word. The function of the custom voice wake-up word can be implemented by the terminal device or by the server. Alternatively, the function of the custom speech wake-up word may be provided by a device that recognizes the dialect to which the speech wake-up word belongs.

Taking the function of providing a custom wake-up word by the terminal device as an example, the terminal device can provide the user with an entry for a custom wake-up word. The portal can be implemented as a physical button, based on which the user can click on the physical button to trigger a wake-up word customization operation. Alternatively, the entry may be a wake-up word customization sub-item in a setting option of the terminal device, based on which the user may enter a setting option of the terminal device, and then click, hover or long-press for the wake-up word customization sub-item, etc. Operation, which triggers a wake-up word custom action. Regardless of the manner in which the user triggers the wake-up word custom operation, the terminal device can receive the customized voice signal input by the user in response to the wake-up word custom operation, and save the received custom voice signal as a voice wake-up. word. Optionally, the terminal device can display an audio entry page to the user to record a customized voice signal sent by the user. For example, after the user triggers the wake-up word customization operation, the terminal device displays the audio input page to the user. At this time, the user can input the voice signal “hello”, and the terminal device will receive the voice signal after receiving the voice signal “hello”. "Hello" is set to the voice wake up word. Optionally, the terminal device may maintain a wake-up vocabulary and save the user-defined voice wake-up words to the wake-up vocabulary.

Optionally, the voice wake-up word should not be too long to reduce the difficulty in identifying the dialect, but it should not be too short. The speech wake-up word is too short, the recognition is not high, and it is easy to cause false wake-up. For example, the voice wake-up word can be between 3 and 5 characters, but is not limited thereto. The one character here refers to one Chinese character or one English letter.

Optionally, when customizing the wake-up words, you can select words that are easy to distinguish, and you should not use more common words to reduce the chance that the application will be awakened by mistake.

In other embodiments of the present application, the voice wake-up word is mainly used to wake up or activate the voice recognition function of the application, and may not define the dialect to which the voice wake-up word belongs, that is, the user may use any dialect or Mandarin to issue the voice wake-up word. After the voice wake-up word is issued, the user may re-issue a voice signal having a dialect indicating meaning, for example, the voice signal may be a voice signal whose contents are "Tianjin dialect", "Henan dialect", "enable Minnan dialect", and the like. Then, the dialect that needs speech recognition can be parsed from the voice signal with the dialect indicating meaning sent by the user, and then the ASR model corresponding to the parsed dialect is selected from the ASR models corresponding to different dialects, and based on the selected The ASR model performs speech recognition on subsequent speech signals to be recognized. For convenience of distinction and description, a speech signal having a dialect indicating meaning herein is referred to as a first speech signal, and a dialect parsed from the first speech signal is referred to as a first dialect.

The voice signal having the dialect guiding meaning can be used as the first voice signal in the embodiment of the present application. For example, the first speech signal may be a speech signal emitted by the user in the first dialect such that the first dialect may be identified based on the acoustic characteristics of the first speech signal. Alternatively, the first voice signal may be a voice signal containing the name of the first dialect, for example, in the voice signal "Please enable the Minnan dialect model", the "Minnan dialect" is the name of the first dialect. Based on this, the phoneme segment corresponding to the name of the first dialect can be extracted from the first voice signal, thereby identifying the first dialect.

The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal may be implemented by the terminal device and the server, or may be implemented independently by the terminal device or the server. The following will explain the different implementations separately:

Mode A: The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server. In mode A, the terminal device supports a voice wake-up function. When the user wants to perform voice recognition, the voice wake-up word can be input to the terminal device to wake up the voice recognition function. The terminal device receives the voice wake-up word to wake up the voice recognition function. Then, the user inputs a first voice signal having a dialect guiding meaning to the terminal device; after receiving the first voice signal input by the user, the terminal device parses the first dialect that needs voice recognition from the first voice signal, that is, the subsequent to be recognized The dialect to which the speech signal belongs, thereby providing a basis for speech recognition using the corresponding ASR model of the dialect.

After parsing the first dialect from the first voice signal, the terminal device sends a service request to the server, where the service request instructs the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects. After receiving the service request sent by the terminal device, the server selects the ASR model corresponding to the first dialect from the ASR model corresponding to the different dialects according to the indication of the service request, so as to perform the subsequent to-be-identified voice signal based on the ASR model corresponding to the first dialect. Speech Recognition.

After transmitting the service request to the server, the terminal device continues to send the to-be-identified voice signal to the server, where the to-be-identified voice signal belongs to the first dialect. The server receives the to-be-identified voice signal sent by the terminal device, and performs voice recognition on the voice signal to be recognized according to the ASR model corresponding to the selected first dialect. For the recognition of speech signals, the matching ASR model for speech recognition is beneficial to improve the accuracy of speech recognition.

Optionally, the to-be-identified voice signal may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and based on this, the terminal device may further receive the user input before sending the to-be-identified voice signal to the server. Identify the voice signal. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.

In some exemplary embodiments, the speech wake-up word is primarily used to wake up the speech recognition function of the terminal device; and the first dialect that subsequently requires speech recognition may be provided by the first speech signal. Based on this, it is possible to not limit the language used by the user to issue a voice wake-up word. For example, the user can issue a speech wake-up word using Mandarin, or can also use a first dialect to issue a speech wake-up word, or can also use a different dialect than the first dialect to issue a speech wake-up word.

However, for the same user, it is possible and possible to issue a voice signal to the terminal device in the same language mode during use of the terminal device. That is, the user may input the voice wake-up word and the first voice signal to the terminal device using the same dialect. For the application scenario, after receiving the first voice signal input by the user, the terminal device may preferentially parse the first dialect from the first voice signal; if the first dialect cannot be parsed from the first voice signal, the voice may be recognized. The dialect to which the awakening word belongs is used as the first dialect. The implementation manner of specifically identifying the dialect to which the voice wake-up word belongs is the same as the embodiment of the dialect in which the voice wake-up word is recognized in the above embodiment, and details are not described herein again.

Mode B: The above-mentioned voice recognition method combining the voice wake-up word and the first voice signal is implemented by the terminal device and the server. In the mode B, the terminal device is mainly configured to receive the voice wake-up word and the first voice signal input by the user and report the signal to the server, so that the server parses the first dialect from the first voice signal, which is different from the mode A. Terminal Equipment. Correspondingly, the server provides the ASR model for different dialects and selects the corresponding ASR model for speech recognition of the speech signal in the corresponding dialect. It also has the function of parsing the first dialect from the first speech signal.

In mode B, when the user wants to perform voice recognition, a voice wake-up word can be input to the terminal device. The terminal device receives the voice wake-up word input by the user, and sends the voice wake-up word to the server. The server wakes up its own speech recognition function based on the voice wake up words. After inputting the voice wake-up word, the user may continue to send the first voice signal to the terminal device. The terminal device transmits the received first voice signal to the server. The server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, so as to facilitate the subsequent voice of the first dialect based on the ASR model corresponding to the first dialect. The signal is speech recognized.

After transmitting the first voice signal to the server, the terminal device continues to send the to-be-identified voice signal to the server. After selecting the ASR model corresponding to the first dialect, the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech. Optionally, the to-be-identified voice may be a voice signal that the user continues to input to the terminal device after inputting the first voice signal, and the terminal device may further receive the user input to be recognized before sending the to-be-identified voice signal to the server. voice signal. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally in the terminal device.

In some exemplary embodiments, before the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying the voice wake-up words The dialect to which it belongs is the first dialect.

In some exemplary embodiments, the server, when parsing the first dialect that requires speech recognition from the first speech signal, includes: converting the first speech signal to the first phoneme sequence based on the acoustic model; storing the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.

Mode C: The above voice recognition method combining the voice wake-up word and the first voice signal is separately implemented by the terminal device or the server. In the mode C, when the user wants to perform voice recognition, a voice wake-up word can be input to the terminal device or the server. The terminal device or the server wakes up the voice recognition function according to the voice wake-up word input by the user. After inputting the voice wake-up word, the user may continue to input the first voice signal having the dialect guiding meaning to the terminal device or the server. The terminal device or the server parses the first dialect from the first voice signal, and selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

After selecting the ASR model corresponding to the first dialect, the terminal device or the server uses the ASR model corresponding to the first dialect to perform speech recognition on the recognized speech. Optionally, the to-be-identified voice may be a voice signal that the user continues to input to the terminal device or the server after inputting the first voice signal, and the terminal device or the server performs the voice signal to be recognized by using the ASR model corresponding to the first dialect. Before the voice recognition, the voice signal to be recognized input by the user may also be received. Alternatively, the to-be-identified voice signal may also be a voice signal pre-recorded and stored locally at the terminal device or the server, based on which the terminal device or the server may directly obtain the voice signal to be recognized from the local.

In some exemplary embodiments, before the terminal device or the server selects the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the method further includes: if the first dialect is not parsed from the first voice signal, identifying The dialect to which the voice wake-up word belongs is used as the first dialect.

In some exemplary embodiments, the terminal device or the server, when parsing the first dialect that needs to perform speech recognition from the first speech signal, includes: converting the first speech signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names stored in the first phoneme sequence are matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.

Optionally, in the foregoing manners A, B, and C, parsing the first dialect that needs to perform speech recognition from the first voice signal, including: converting the first voice signal into the first phoneme sequence based on the acoustic model; The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.

Wherein, before converting the first speech signal into the first phoneme sequence based on the acoustic model, preprocessing and feature extraction of the first speech signal are required. The preprocessing process includes pre-emphasis, windowing framing, and endpoint detection. The feature extraction is to extract the acoustic features such as time domain features or frequency domain features of the preprocessed first speech signal.

The acoustic model can convert the acoustic characteristics of the first speech signal into a phoneme sequence. Phonemes are the basic elements that make up the pronunciation of a word or the pronunciation of a Chinese character. Among them, the phonemes constituting the pronunciation of a word may be 39 phonemes invented by Carnegie Mellon University; the phonemes constituting the pronunciation of Chinese characters may be all initials and finals. Acoustic models include, but are not limited to, neural network based deep learning models, hidden Markov models, and the like. The manner of converting the acoustic features into the phoneme sequences belongs to the prior art, and details are not described herein again.

After converting the first voice signal into the first phoneme sequence, the terminal device or the server respectively matches the phoneme segments corresponding to the different dialect names in the first phoneme sequence. Wherein, phoneme fragments of different dialect names may be pre-stored, for example, a phoneme fragment of the dialect name "Henan dialect", a phoneme fragment of the dialect name "Minnan", a dialect name "British English", and the like. If the dialect name is a word, the phoneme fragment is a segment composed of several phonemes obtained from the 39 phonemes invented by Carnegie Mellon University. If the dialect name is a Chinese character, the phoneme fragment is a fragment composed of the initials and finals of the dialect name. The phoneme segments corresponding to the different dialect names stored in advance are compared to determine whether the first phoneme sequence contains a phoneme segment identical or similar to the phoneme segment of a certain dialect name. Optionally, a similarity between each phoneme segment in the first phoneme sequence and a phoneme segment in a different dialect name may be calculated; and from a phoneme segment of a different dialect name, selecting a similarity with a phoneme segment in the first phoneme sequence satisfies The phoneme fragment required by the preset similarity is used as the audio segment in the matching. Then, the dialect corresponding to the phoneme segment in the match is used as the first dialect.

It should be noted that some steps or contents in the foregoing manners A, B, and C are the same or similar to those in the embodiments shown in FIG. 1 to FIG. 7. The same or similar contents can be seen in FIG. - The description in the embodiment shown in FIG. 7 will not be repeated here.

In addition, some of the processes described in the above-described embodiments and the accompanying drawings include a plurality of operations occurring in a specific order, but it should be clearly understood that the operations may be performed in the order in which they are presented or executed in parallel. The serial number of the operation, such as 201, 202, etc., is only used to distinguish different operations, and the serial number itself does not represent any execution order. Additionally, these processes may include more or fewer operations, and these operations may be performed sequentially or in parallel. It should be noted that the descriptions of “first” and “second” in this document are used to distinguish different messages, devices, modules, etc., and do not represent the order, nor the “first” and “second”. It is a different type.

FIG. 8 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 8, the voice recognition apparatus 800 includes a receiving module 801, an identifying module 802, a first transmitting module 803, and a second transmitting module 804.

The receiving module 801 is configured to receive a voice wake-up word.

The identification module 802 is configured to identify a first dialect to which the voice wake-up word received by the receiving module 801 belongs.

The first sending module 803 is configured to send a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

The second sending module 804 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.

In an optional implementation manner, when the recognition module 802 identifies the first dialect to which the voice wake-up word belongs, the method is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and acquire and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.

In an optional implementation manner, when receiving the voice wake-up word, the receiving module 801 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface. .

In an optional implementation manner, before sending the to-be-identified voice signal to the server, the second sending module 804 is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.

In an optional implementation manner, before outputting the voice input prompt information, the second sending module 804 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.

In an optional implementation manner, before receiving the voice wake-up word, the receiving module 801 is further configured to: receive a custom voice signal input by the user in response to the wake-up word customization operation; and save the customized voice signal as a voice wake-up word. The internal function and structure of the speech recognition apparatus 800 are described above. As shown in FIG. 9, in actuality, the speech recognition apparatus 800 can be implemented as a terminal apparatus, including: a memory 901, a processor 902, and a communication component 903.

The memory 901 is configured to store a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.

The memory 901 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

The processor 902 is coupled to the memory 901 for executing a computer program in the memory 901 for: receiving a voice wake-up word through the communication component 903; identifying a first dialect to which the voice wake-up word belongs; and transmitting a service to the server through the communication component 903 The request is to request the server to select the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and send the to-be-identified voice signal to the server through the communication component 903, so that the server uses the ASR model corresponding to the first dialect to perform the recognized speech signal. Speech Recognition.

The communication component 903 is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.

In an optional implementation manner, when the processor 902 identifies the first dialect to which the voice wake-up word belongs, the processor 902 is specifically configured to:

The voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects, and the dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement is obtained as the first dialect; or the voice is spoken The acoustic features of the wake-up words are respectively matched with the acoustic features of different dialects, and the dialects that match the acoustic characteristics of the speech wake-up words according to the second setting requirement are obtained as the first dialect; or the speech wake-up words are converted into the text wake-up words. The text wake-up words are respectively matched with the reference text wake-up words corresponding to different dialects, and the dialect corresponding to the reference text wake-up words whose matching degree with the text wake-up words meets the third setting requirement is obtained as the first dialect.

In an optional implementation manner, as shown in FIG. 9, the terminal device further includes: a display screen 904. Based on this, when receiving the voice wake-up word, the processor 902 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 904; and acquire a voice wake-up word input by the user based on the voice input interface. .

In an optional implementation, the terminal device further includes: an audio component 906. Based on this, before transmitting the to-be-identified voice signal to the server, the processor 902 is further configured to: output the voice input prompt information through the audio component 906 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user through the audio component 906. Correspondingly, the audio component 906 is further configured to output voice input prompt information and receive a voice signal to be recognized input by the user.

In an optional implementation, before outputting the voice input prompt information, the processor 902 is further configured to: receive, by the communication component 903, a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.

In an optional implementation, before receiving the voice wake-up word, the processor 902 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 903; save the customized voice signal as Voice wake up words.

Further, as shown in FIG. 9, the terminal device further includes: a power component 905 and other components.

Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the terminal device in the foregoing method embodiment can be implemented.

FIG. 10 is a schematic structural diagram of a module of another voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 10, the voice recognition apparatus 1000 includes a first receiving module 1001, a selecting module 1002, a second receiving module 1003, and an identifying module 1004.

The first receiving module 1001 is configured to receive a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected.

The selecting module 1002 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, where the first dialect is a dialect to which the voice wake-up word belongs.

The second receiving module 1003 is configured to receive a to-be-identified voice signal sent by the terminal device.

The identification module 1004 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1003 by using the ASR model corresponding to the first dialect.

In an optional implementation, the voice recognition apparatus 1000 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.

The internal function and structure of the speech recognition apparatus 1000 have been described above. As shown in FIG. 11, in practice, the speech recognition apparatus 1000 can implement a server including a memory 1101, a processor 1102, and a communication component 1103.

The memory 1101 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.

The memory 1101 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

The processor 1102 is coupled to the memory 1101, and is configured to execute a computer program in the memory 1101, configured to: receive, by the communication component 1103, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected; from different dialects In the corresponding ASR model, the ASR model corresponding to the first dialect is selected, and the first dialect is a dialect to which the voice wake-up word belongs; the communication component 1103 receives the to-be-identified voice signal sent by the terminal device, and treats the ASR model corresponding to the first dialect. The speech signal is recognized for speech recognition.

The communication component 1103 is configured to receive the service request and the to-be-identified voice signal.

In an optional implementation, the processor 1102 is configured to: collect corpora of different dialects before extracting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; perform feature extraction on corpora of different dialects, In order to obtain the acoustic characteristics of different dialects; according to the acoustic characteristics of different dialects, construct ASR models corresponding to different dialects.

Further, as shown in FIG. 11, the server further includes an audio component 1106. Based on this, the processor 1102 is further configured to: receive, by the audio component 1106, the to-be-identified voice signal sent by the terminal device.

Optionally, as shown in FIG. 11, the server further includes a display 1104, a power component 1105, and the like.

Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the server in the foregoing method embodiment can be implemented.

In this embodiment, the ASR model is constructed for different dialects. In the speech recognition process, the dialect to which the speech wake-up word belongs is recognized in advance, and then the ASR model corresponding to the dialect to which the speech wake-up word belongs is selected from the ASR models corresponding to different dialects. The selected ASR model performs speech recognition on the subsequent speech signals to be recognized, realizes the automation of multi-dial speech recognition, and automatically selects the ASR model of the corresponding dialect based on the speech wake-up words, without the user's manual operation, which is more convenient and quick to implement, and is beneficial to Improve the efficiency of multi-word speech recognition.

FIG. 12 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 12, the voice recognition apparatus 1200 includes a receiving module 1201, a first transmitting module 1202, and a second transmitting module 1203.

The receiving module 1201 is configured to receive a voice wake-up word.

The first sending module 1202 is configured to send, to the server, the voice wake-up words received by the receiving module 1201, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words.

The second sending module 1203 is configured to send the to-be-identified voice signal to the server, so that the server performs voice recognition on the voice signal to be recognized by using the ASR model corresponding to the first dialect.

In an optional implementation manner, when receiving the voice wake-up word, the receiving module 1201 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire a voice wake-up word input by the user based on the voice input interface. .

In an optional implementation manner, before sending the to-be-identified voice signal to the server, the second sending module 1203 is further configured to: output voice input prompt information to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.

In an optional implementation manner, before outputting the voice input prompt information, the second sending module 1203 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.

In an optional implementation, before receiving the voice wake-up word, the receiving module 1201 is further configured to: receive a customized voice signal input by the user in response to the wake-up word customization operation. The first sending module 1202 is further configured to upload a customized voice signal to the server.

The internal function and structure of the speech recognition apparatus 1200 are described above. As shown in FIG. 13, in actuality, the speech recognition apparatus 1200 can be implemented as a terminal device, including: a memory 1301, a processor 1302, and a communication component 1303.

The memory 1301 is for storing a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.

The memory 1301 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

The processor 1302 is coupled to the memory 1301 for executing a computer program in the memory 1301 for: receiving a voice wake-up word through the communication component 1303; and transmitting a voice wake-up word to the server through the communication component 1303 for the server to wake-up words based on the voice The ASR model corresponding to the first dialect to which the voice wake-up word belongs is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component 1303, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal. Identification.

The communication component 1303 is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server

In an alternative embodiment, as shown in FIG. 13, the terminal device further includes a display screen 1304. Based on this, when receiving the voice wake-up word, the processor 1302 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1304; and acquire a voice wake-up word input by the user based on the voice input interface. .

In an alternative embodiment, as shown in FIG. 13, the terminal device further includes an audio component 1306. Based on this, the processor 1302 is configured to receive the speech wake-up words through the audio component 1306. Correspondingly, before sending the to-be-identified voice signal to the server, the processor 1302 is further configured to: output the voice input prompt information through the audio component 1306 to prompt the user to perform voice input; and receive the voice signal to be recognized input by the user.

In an optional implementation, before outputting the voice input prompt information, the processor 1302 is further configured to: receive a notification message returned by the server, where the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.

In an optional implementation, before receiving the voice wake-up word, the processor 1302 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1303, and upload the customized voice signal. To the server.

Further, as shown in FIG. 13, the terminal device further includes: a power component 1305 and other components.

FIG. 14 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 14, the voice recognition apparatus 1400 includes a first receiving module 1401, a first identifying module 1402, a selecting module 1403, a second receiving module 1404, and a second identifying module 1405.

The first receiving module 1401 is configured to receive a voice wake-up word sent by the terminal device.

The first identification module 1402 is configured to identify a first dialect to which the voice wake-up word belongs.

The selecting module 1403 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

The second receiving module 1404 is configured to receive a to-be-identified voice signal sent by the terminal device.

The second identification module 1405 is configured to perform voice recognition on the to-be-identified voice signal received by the second receiving module 1404 by using the ASR model corresponding to the first dialect.

In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the first identifier module 1402 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice The dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained. The dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.

In an optional implementation, the speech recognition apparatus 1400 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.

The internal function and structure of the speech recognition apparatus 1400 are described above. As shown in FIG. 15, in actuality, the speech recognition apparatus 1400 can be implemented as a server including: a memory 1501, a processor 1502, and a communication component 1503.

The memory 1501 is for storing a computer program and can be stored to store other various data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.

The memory 1501 can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

The processor 1502 is coupled to the memory 1501 for executing a computer program in the memory 1501, for: receiving, by the communication component 1503, a voice wake-up word sent by the terminal device; identifying a first dialect to which the voice wake-up word belongs; corresponding to different dialects In the ASR model, the ASR model corresponding to the first dialect is selected; the voice signal to be recognized sent by the terminal device is received by the communication component 1503, and the voice signal to be recognized is identified by the ASR model corresponding to the first dialect.

The communication component 1503 is configured to receive a voice wake-up word and a voice signal to be recognized.

In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the processor 1502 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.

In an optional implementation manner, the processor 1502 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.

Further, as shown in FIG. 15, the server further includes an audio component 1506. Based on this, the processor 1502 is configured to: receive, by the audio component 1506, a voice wake-up word sent by the terminal device, and receive, by the audio component 1506, the voice signal to be recognized sent by the terminal device.

Further, as shown in FIG. 15, the server further includes: a display 1504, a power component 1505, and the like.

FIG. 16 is a schematic structural diagram of a module of a voice recognition apparatus according to still another exemplary embodiment of the present application. As shown in FIG. 16, the voice recognition apparatus 1600 includes a receiving module 1601, a first identifying module 1602, a selecting module 1603, and a second identifying module 1604.

The receiving module 1601 is configured to receive a voice wake-up word.

The first identification module 1602 is configured to identify a first dialect to which the voice wake-up word belongs.

The selecting module 1603 is configured to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects.

The second identification module 1604 is configured to perform speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect.

In an optional implementation manner, when identifying the first dialect to which the voice wake-up word belongs, the first identifying module 1602 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, Obtaining a dialect corresponding to the reference wake-up word matching the first set requirement of the voice wake-up word as the first dialect; or matching the acoustic features of the voice wake-up word with the acoustic features of different dialects respectively, and acquiring the awakened word with the voice The dialect of the acoustic feature matching the second setting requirement is used as the first dialect; or the speech wake-up word is converted into the text wake-up word, and the text wake-up word is respectively matched with the reference text wake-up word corresponding to different dialects, and the text wake-up is obtained. The dialect corresponding to the reference text wake-up word of the third setting requirement is used as the first dialect.

In an optional implementation manner, when receiving the voice wake-up word sent by the terminal device, the receiving module 1601 is specifically configured to: display a voice input interface to the user in response to an instruction to activate or turn on the terminal device; and acquire user input based on the voice input interface. Voice wake up words.

In an optional implementation manner, the second identification module 1604 is further configured to: output voice input prompt information to prompt the user to perform voice input, and receive the user before using the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized. The voice signal to be recognized is input.

In an optional implementation, the voice recognition apparatus 1600 further includes a building module, configured to collect corpora of different dialects before selecting an ASR model corresponding to the first dialect in an ASR model corresponding to different dialects; and corpus for different dialects Feature extraction is performed to obtain acoustic features of different dialects; according to the acoustic characteristics of different dialects, ASR models corresponding to different dialects are constructed.

In an optional implementation, before receiving the voice wake-up word, the receiving module 1601 is further configured to: receive a custom voice signal input by the user in response to the wake-up word custom operation; and save the customized voice signal as a voice wake-up word.

The internal function and structure of the speech recognition apparatus 1600 are described above. As shown in FIG. 17, in practice, the speech recognition apparatus 1600 can be implemented as an electronic device including: a memory 1701, a processor 1702, and a communication component 1703. The electronic device can be a terminal device or a server.

The memory 1701 is configured to store a computer program and can be stored to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method for operation on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.

The memory 1701 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

The processor 1702 is coupled to the memory 1701 for executing a computer program in the memory 1701 for: receiving a speech wake-up word through the communication component 1703; identifying a first dialect to which the speech wake-up word belongs; from an ASR model corresponding to different dialects The ASR model corresponding to the first dialect is selected; the ASR model corresponding to the first dialect is used to perform speech recognition on the speech signal to be recognized.

The communication component 1703 is configured to receive a voice wake-up word.

In an optional implementation manner, when the processor 1702 identifies the first dialect to which the voice wake-up word belongs, the processor 1702 is specifically configured to: dynamically match the voice wake-up words with the reference wake-up words recorded in different dialects, and obtain and The dialect of the speech wake-up word matches the dialect corresponding to the reference wake-up word of the first setting requirement as the first dialect; or the acoustic features of the speech-awakening word are respectively matched with the acoustic features of different dialects, and the acoustic characteristics of the awakened word are obtained. The dialect that matches the second setting requirement is used as the first dialect; or the voice wake-up word is converted into the text wake-up word, and the text wake-up word is matched with the reference text wake-up word corresponding to different dialects respectively, and the text wake-up word is obtained. The dialect corresponding to the reference text wake-up word whose matching degree meets the third setting requirement is used as the first dialect.

In an optional implementation, as shown in FIG. 17, the electronic device further includes: a display screen 1704. Based on this, when receiving the voice wake-up words sent by the terminal device, the processor 1702 is specifically configured to: according to an instruction to activate or turn on the terminal device, display a voice input interface to the user through the display screen 1704; and obtain user input based on the voice input interface. Voice wake up words.

In an alternative embodiment, as shown in FIG. 17, the electronic device further includes an audio component 1706. Based on this, before the speech recognition of the speech signal to be recognized by the ASR model corresponding to the first dialect, the processor 1702 is further configured to: output the voice input prompt information through the audio component 1706 to prompt the user to perform voice input; and receive the user input. The speech signal to be recognized. Correspondingly, the processor 1702 is further configured to: receive the speech wake-up words through the audio component 1706.

In an optional implementation manner, the processor 1702 is configured to collect corpora of different dialects before selecting the ASR model corresponding to the first dialect in the ASR model corresponding to different dialects; and extracting features of different dialect corpora to The acoustic characteristics of different dialects are obtained. According to the acoustic characteristics of different dialects, the ASR models corresponding to different dialects are constructed.

In an optional implementation, before receiving the voice wake-up word, the processor 1702 is further configured to: in response to the wake-up word custom operation, receive the customized voice signal input by the user through the communication component 1703; save the customized voice signal as Voice wake up words. Further, as shown in FIG. 17, the electronic device further includes: a power component 1705 and other components.

Correspondingly, the embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the steps executable by the electronic device in the foregoing method embodiment can be implemented.

The embodiment of the present application further provides a terminal device, including: a memory, a processor, and a communication component.

A memory for storing a computer program and can be stored to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phone book data, messages, pictures, videos, and the like.

The memory can be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), and erasable programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.

a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and transmitting a service request to the server through the communication component to request the server The ASR model corresponding to the first dialect is selected from the ASR models corresponding to different dialects; the voice signal to be recognized is sent to the server through the communication component, so that the server uses the ASR model corresponding to the first dialect to perform voice recognition on the voice signal to be recognized.

The communication component is configured to receive a voice wake-up word and the first voice signal, and send a service request and a voice signal to be recognized to the server.

In an optional implementation, before sending the service request to the server, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify a dialect to which the voice wake-up word belongs as the first dialect.

In an alternative embodiment, the memory is further configured to store phoneme segments corresponding to different dialect names. Correspondingly, when parsing the first dialect that needs to perform speech recognition from the first speech signal, the processor is specifically configured to: convert the first speech signal into a first phoneme sequence based on an acoustic model; store the memory in the memory The phoneme segments corresponding to the different dialect names are respectively matched in the first phoneme sequence; when the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the matching is used as the first dialect.

The embodiment of the present application further provides a server, including: a memory, a processor, and a communication component.

A memory for storing computer programs and stored to store various other data to support operations on the server. Examples of such data include instructions for any application or method operating on the server, contact data, phone book data, messages, pictures, videos, and the like.

a processor, coupled to the memory and the communication component, for executing a computer program in the memory, for: receiving, by the communication component, a voice wake-up word sent by the terminal device to wake up the voice recognition function; receiving, by the communication component, the terminal device sends the a dialect indicating a meaning of the first voice signal; parsing a first dialect that requires voice recognition from the first voice signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; receiving the terminal device by the communication component The speech signal to be recognized, and the speech signal to be recognized by the ASR model corresponding to the first dialect is used for speech recognition.

And a communication component, configured to receive a voice wake-up word, a first voice signal, and a voice signal to be recognized.

In an optional implementation, before selecting the ASR model corresponding to the first dialect from the ASR models corresponding to different dialects, the processor is further configured to: if the first dialect is not parsed from the first voice signal, identify the voice The dialect to which the awakening word belongs is used as the first dialect.

The embodiment of the present application further provides an electronic device, which may be a terminal device or a server. The electronic device includes a memory, a processor, and a communication component.

A memory for storing computer programs and stored to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.

a processor coupled to the memory and the communication component for executing a computer program in the memory for: receiving a voice wake-up word through the communication component to wake up the voice recognition function; receiving, by the communication component, a user-entered dialect indicating a speech signal; parsing a first dialect that requires speech recognition from the first speech signal; selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects; and using the ASR model corresponding to the first dialect to treat the speech signal Perform speech recognition.

And a communication component, configured to receive the voice wake-up word and the first voice signal.

The communication components in Figures 9, 11, 13, 15, and 17 above are stored to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device in which the communication component is located can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display screens of FIGS. 9, 11, 13, 15, and 17 described above include a liquid crystal display (LCD) and a touch panel (TP). If the display includes a touch panel, the display can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.

The power components in Figures 9, 11, 13, 15, and 17 above provide power to the various components of the device in which the power components are located. The power components can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

The audio components of Figures 9, 11, 13, 15, and 17 above may be stored as output and/or input audio signals. For example, the audio component includes a microphone (MIC) that is stored to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal can be further stored in a memory or transmitted via a communication component. In some embodiments, the audio component further includes a speaker for outputting an audio signal.

Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

In a typical storage, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.

Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.

It is also to be understood that the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, Other elements not explicitly listed, or elements that are inherent to such a process, method, commodity, or equipment. An element defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device including the element.

The above description is only an embodiment of the present application and is not intended to limit the application. Various changes and modifications can be made to the present application by those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included within the scope of the appended claims.

Claims

A voice recognition method is applicable to a terminal device, and the method includes:

Receiving a speech wake-up word;

Identifying a first dialect to which the voice wake-up word belongs;

Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
The method according to claim 1, wherein the identifying the first dialect to which the voice wake-up word belongs comprises:

The voice wake-up words are dynamically matched with the reference wake-up words recorded in different dialects, and the dialect corresponding to the reference wake-up words whose matching degree with the voice wake-up words meets the first setting requirement is obtained as the first One party; or

Correlating the acoustic features of the speech wake words with the acoustic features of different dialects respectively, and obtaining a dialect that matches the acoustic characteristics of the speech wake words according to the second setting requirement as the first dialect; or

Converting the voice wake-up words into text wake-up words, matching the text wake-up words with reference text wake-up words corresponding to different dialects, and obtaining a reference text that matches the text wake-up words according to the third set requirement The dialect corresponding to the wake-up word is used as the first dialect.
The method according to claim 1, wherein said receiving a speech wake-up word comprises:

Presenting a voice input interface to the user in response to an instruction to activate or turn on the terminal device;

Acquiring a voice wake-up word input by the user based on the voice input interface.
The method according to any one of claims 1-3, wherein before the sending the to-be-identified voice signal to the server, the method further comprises:

Outputting a voice input prompt message to prompt the user to perform voice input;

Receiving a voice signal to be recognized input by the user.
The method according to claim 4, wherein before the outputting the voice input prompt information, the method further comprises:

Receiving a notification message returned by the server, the notification message is used to indicate that the ASR model corresponding to the first dialect has been selected.
The method according to any one of claims 1-3, wherein before receiving the speech wake-up word, the method further comprises:

Receiving a custom voice signal input by the user in response to the wake-up word custom operation;

The custom speech signal is saved as the speech wake-up word.
A voice recognition method is applicable to a server, and the method includes:

Receiving a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;

Selecting, in the ASR model corresponding to different dialects, an ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word belongs;

And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
The method according to claim 7, wherein before the ASR model corresponding to the first dialect is selected from the ASR models corresponding to different dialects, the method further includes:

Collect corpora of different dialects;

Feature extraction of corpora of different dialects to obtain acoustic features of different dialects;

According to the acoustic characteristics of the different dialects, the ASR models corresponding to different dialects are constructed.
A voice recognition method is applicable to a terminal device, and the method includes:

Receiving a speech wake-up word;

Sending the voice wake-up word to the server, for the server to select an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;

Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
A voice recognition method is applicable to a server, and the method includes:

Receiving a voice wake-up word sent by the terminal device;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

And receiving the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
A speech recognition method, comprising:

Receiving a speech wake-up word;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;

The ASR model corresponding to the first dialect is used for speech recognition of the speech signal to be recognized.
A voice recognition method is applicable to a terminal device, and the method includes:

Receiving a voice wake-up word to wake up the voice recognition function;

Receiving a first voice signal input by a user with a dialect indicating meaning;

Parsing a first dialect that requires speech recognition from the first speech signal;

Sending a service request to the server, requesting the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending a to-be-identified voice signal to the server, so that the server performs voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.
The method of claim 12, wherein before the sending of the service request to the server, the method further comprises:

If the first dialect is not parsed from the first voice signal, the dialect to which the voice wake-up word belongs is identified as the first dialect.
The method according to claim 12 or 13, wherein the parsing the first dialect that needs speech recognition from the first speech signal comprises:

Converting the first speech signal into a first phoneme sequence based on an acoustic model;

Matching phoneme segments corresponding to different dialect names in the first phoneme sequence;

When the middle phoneme segment is matched in the first phoneme sequence, the dialect corresponding to the phoneme segment in the match is used as the first dialect.
A terminal device, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Identifying a first dialect to which the voice wake-up word belongs;

Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word, and send the service request and the to-be-identified voice signal to the server.
A server, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving, by the communication component, a service request sent by the terminal device, where the service request indicates that the ASR model corresponding to the first dialect is selected;

Selecting, from the ASR model corresponding to the different dialects, the ASR model corresponding to the first dialect, where the first dialect is a dialect to which the voice wake-up word received by the terminal device belongs;

Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the service request and the to-be-identified voice signal.
A terminal device, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Sending, by the communication component, the voice wake-up words to the server, so that the server selects an ASR model corresponding to the first dialect to which the voice wake-up word belongs from the ASR models corresponding to different dialects based on the voice wake-up words;

Sending, by the communication component, the to-be-identified voice signal to the server, for the server to perform voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word, and send the voice wake-up word and the to-be-identified voice signal to the server.
A server, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving, by the communication component, a voice wake-up word sent by the terminal device;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Receiving, by the communication component, the to-be-identified voice signal sent by the terminal device, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word and the to-be-identified voice signal.
An electronic device, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component;

Identifying a first dialect to which the voice wake-up word belongs;

Selecting an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects;

Performing speech recognition on the speech signal to be recognized by using the ASR model corresponding to the first dialect;

The communication component is configured to receive the voice wake-up word.
A terminal device, comprising: a memory, a processor, and a communication component;

The memory for storing a computer program;

The processor, coupled to the memory, for executing the computer program for:

Receiving a voice wake-up word through the communication component to wake up the voice recognition function;

Receiving, by the communication component, a first voice signal input by a user with a dialect indicating meaning;

Parsing a first dialect that requires speech recognition from the first speech signal;

Sending, by the communication component, a service request to the server, to request the server to select an ASR model corresponding to the first dialect from the ASR models corresponding to different dialects;

Sending, by the communication component, a voice signal to be identified to the server, for the server to perform voice recognition on the voice signal to be recognized by using an ASR model corresponding to the first dialect

The communication component is configured to receive the voice wake-up word and the first voice signal, and send the service request and the to-be-identified voice signal to the server.
A computer readable storage medium storing a computer program, wherein the computer program, when executed by a computer, is capable of implementing the steps of the method of any of claims 1-6.
A computer readable storage medium storing a computer program, wherein the computer program, when executed by a computer, is capable of implementing the steps of the method of any of claims 7-8.
A speech recognition system, comprising: a server and a terminal device;

The terminal device is configured to receive a voice wake-up word, identify a first dialect to which the voice wake-up word belongs, send a service request to the server, and send a to-be-identified voice signal to the server, where the service request indicates selection The ASR model corresponding to the first dialect;

The server is configured to receive the service request, select an ASR model corresponding to the first dialect, and receive the to-be-identified voice signal from an ASR model corresponding to different dialects according to the indication of the service request, and Performing speech recognition on the to-be-identified speech signal by using the ASR model corresponding to the first dialect.
A speech recognition system, comprising: a server and a terminal device;

The terminal device is configured to receive a voice wake-up word, send the voice wake-up word to the server, and send the to-be-identified voice signal to the server;

The server is configured to receive the voice wake-up word, identify a first dialect to which the voice wake-up word belongs, select an ASR model corresponding to the first dialect from an ASR model corresponding to different dialects, and receive the waiting Identifying a voice signal, and performing voice recognition on the to-be-identified voice signal by using an ASR model corresponding to the first dialect.