WO2017054122A1

WO2017054122A1 - Speech recognition system and method, client device and cloud server

Info

Publication number: WO2017054122A1
Application number: PCT/CN2015/091042
Authority: WO
Inventors: 李强生
Original assignee: 深圳市全圣时代科技有限公司
Priority date: 2015-09-29
Filing date: 2015-09-29
Publication date: 2017-04-06
Also published as: CN106537493A

Abstract

Disclosed is a speech recognition system, at least comprising: a speech input module configured to input speech of a user in real time upon activation of a real-time call or speech entry function; a feature extraction module configured to extract a speech feature from the inputted speech of the user; a model training module configured to establish, according to the speech feature and a preset rule, a corresponding acoustic and language model; and an updating module configured to save and update the acoustic and language model in a model database. Also provided are a speech recognition method, a client device and a cloud server.

Description

Speech recognition system and method, client device and cloud server

Technical field

The present invention relates to the field of voice recognition, and in particular, to a voice recognition system and method, and a client device and a cloud server with voice recognition function.

Background technique

"Large Vocabulary Continuous Speech Recognition" (LVCSR, referred to as "speech recognition") is a computer that recognizes which text corresponds to a certain piece of speech based on the language information contained in the continuous sound signal of the person. process.

Great vocabulary continuous Chinese speech recognizer has made great progress. For standard Mandarin, the accuracy of the recognizer can reach more than 95%. However, the dialect problem in Chinese is the main problem facing Chinese speech recognition. Since most people in China have a certain dialect background, in most cases, the performance of most speech recognizers will be greatly reduced or even impossible to use.

Currently, devices such as Apple's Siri and China's Keda Xunfei can provide voice input functions, but voice recognition is affected by the user's personal pronunciation, which leads to a great impact on the accuracy of speech recognition, which in turn affects the speech recognition function. Be applicable. In addition, a large number of non-intelligent client devices, when used, have their own voice control functions, and also affect the recognition of voice functions due to the problem of recognition rate during voice input, such as voice operation functions in cars, Bluetooth headsets. , voice control of devices such as doorbells, etc.

At present, the influence of many recognizer backgrounds on the performance of speech recognizers is eliminated or weakened by the database method. That is to say, when there is already a speech recognizer that recognizes standard Mandarin, it needs to have a certain dialect background. In the case of Mandarin recognition, the method is to collect a large number of first speech databases related to the dialect, and then use the existing acoustic model training method to retrain the acoustic model, or use the existing speaker adaptive method to acoustic models. Make adaptive. The disadvantages of this method are: (1) The workload of collecting the database with dialect background is very huge. For so many dialects in Chinese, the collection of the database is a huge project. (2) This method cannot balance standard Mandarin with pronunciation The commonality between the background Mandarins is only solved by the data-driven method, which is equivalent to completely reconstructing a speech recognizer, which brings difficulties in resource sharing and compatibility between speech recognizers of different dialect backgrounds.

Summary of the invention

In order to solve the above technical problem, the present invention provides a voice recognition system and method, and a client device and a cloud server with voice recognition function.

An embodiment of the present invention provides a voice recognition system, including at least: a voice input module, configured to input a user's voice in real time when a real-time call or voice input function is enabled; and a feature extraction module for inputting the user voice Extracting a voice feature; a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule; and an update module, configured to save and update the acoustic and language model into a model database.

Another embodiment of the present invention further provides a voice recognition method, including: inputting a user's voice in real time based on enabling a real-time call or voice input function; extracting a voice feature from the input user voice; according to the voice feature and a preset Rules, establishing corresponding acoustic and language models; and saving and updating the acoustic and language models into a model database in real time.

Yet another embodiment of the present invention provides a client device including the above-described voice recognition system.

Yet another embodiment of the present invention provides a cloud server including a plurality of private cloud master modules corresponding to different users. Each cloud master module includes: a feature extraction module, configured to extract a voice feature from a user voice input from a client device that is enabling real-time call or voice entry function; a model training module, configured to Pre-defined rules to establish corresponding acoustic and language models; and update modules for saving and updating the acoustic and language models into a model database.

The speech recognition system and method of the present invention records or saves real-time call and recorded information in real time and serves as a sample for speech model training, thereby continuously updating the model database according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the following description The drawings in the drawings are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without any inventive labor.

1 is a system frame diagram of a voice recognition system according to a first embodiment of the present invention;

Figure 2 is a functional block diagram of the speech recognition system of Figure 1;

3 is a functional block diagram of a voice recognition system according to a second embodiment of the present invention;

4 is a flowchart of a voice recognition method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a voice recognition method according to another embodiment of the present invention; FIG.

FIG. 6 is a specific flowchart of step S409 in FIG. 5;

FIG. 7 is a flowchart of a voice recognition method according to still another embodiment of the present invention.

detailed description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. It is apparent that the described embodiments are only a part of the embodiments of the invention, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the scope of the present invention.

First embodiment

Please refer to FIG. 1 , which is a system architecture diagram of a voice recognition system 100 according to a first embodiment of the present invention. In this embodiment, the voice recognition system 100 is implemented by the client device 200 and the cloud server 300 to complete the whole process of identifying the front end, model training, and identifying the back end through the cloud server 300, and the final voice recognition result. Delivered to the client device 200. In this way, the data processing capacity of the client device 200 can be alleviated, and the deployment is very convenient, and most of the work of the subsequent upgrade is also completed in the cloud server 300.

Specifically, referring to FIG. 2, the voice recognition system 100 includes at least a voice input module 10, a feature extraction module 20, a model training module 30, and an update module 40. In this embodiment, the voice input module 10 is disposed on the client device 200, such as a microphone and its processing circuit. The feature extraction module 20, the model training module 30, the update module 40, and the like are integrated in the cloud server 300.

The voice input module 10 is configured to input the voice of the user in real time when the client device 200 enables the real-time call or voice input function. The client device 200 can be a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like. The user's voice can also be saved locally or saved in the cloud.

The feature extraction module 20 is configured to extract a voice feature from the input user voice. In this embodiment, the feature extraction module 20 saves the extracted voice features in a first voice database 21 in real time, and the first voice database 21 may be a local database or a cloud database. The speech feature refers to feature data of the user's voice.

The model training module 30 is configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule, so as to match the extracted voice feature with the acoustic and language models in a subsequent recognition process. Compare and get the best recognition results. In this embodiment, the preset rule is in Dynamic Time Warping (DTW), Hidden Markov Model (HMM) theory, and Vector Quantization (VQ) technology. In addition, in this embodiment, the model training module 30 periodically extracts the voice features from the first voice database 21 for model training. Of course, in other embodiments, the The model training module 30 can also extract specific speech features in the first speech database 21 in real time for real-time model training, or quantitative (eg, 100) extraction of the specific speech features, and the present invention is not limited by these embodiments.

The update module 40 is configured to save and update the acoustic and language models into a model database 41 in real time, whereby a larger acoustic and language model database 41 can be acquired, which improves the degree of recognition.

In addition, in order to be able to keep the user's voice information confidential and provide personalized model training for different user voice features, the cloud server 300 includes a plurality of private cloud master modules corresponding to different users, and each private cloud master module The feature extraction module 20, the model training module 30, the update module 40, and the like are included. The specific voice feature extracted by the feature extraction module 20 is saved under the corresponding private cloud module. At the same time, the model training module 30 performs acoustic and language model training on the specific speech features and updates the model through the update module 40. When the user activates the voice recognition system 100, the voice recognition function can be enabled by means of account authentication.

It can be understood that in other embodiments, the voice recognition system 100 can also be integrated in a client device 200, such as an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, etc., for The user turns on offline voice recognition. At this time, the first voice database 21 and the model database 41 are both local databases. In this way, the above voice recognition function can be implemented without a network connection.

In general, in traditional speech recognition technology, the mobile phone is usually not recorded in real time or recorded by a pad (may be other device) during recording, or saved as a speech model training. Sample. The present invention continuously records and stores the real-time call and recording information as a sample of the speech model training, so that the model database 41 can be continuously updated according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved. In addition, the present invention also provides a private cloud master module for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.

Second embodiment

Referring to FIG. 3, the speech recognition system 100a according to the second embodiment of the present invention is substantially the same as the speech recognition system 100 of the first embodiment, except that the speech recognition system 100a further includes an identification module 50 for the identification. The module 50 is configured to determine whether the voice feature can be identified according to the acoustic and language models in the model database 41a, and if not, generate a recognition result carrying the control command; otherwise, store other unrecognizable voice features to In the first voice database 21a. At this time, the first voice database 21a only needs to save the voice features that are not recognized, which saves space. The model training module 30 further includes a manual labeling unit 31, configured to manually map the unrecognizable voice feature with the matching degree lower than the threshold to the preset standard voice according to a user command, and The speech features and the standard speech data and their mapping relationship are updated in a second speech database 33 for use by the recognition module 50. Correspondingly, the identification module 50 is further configured to identify the voice data and output the recognition result according to the currently input user voice data and the second voice database 33.

More specifically, the identification module 50 includes a first decoding unit 51 and a second decoding unit 52, and the first decoding unit 51 is configured to perform a matching degree calculation on the currently extracted voice feature and the acoustic and language models. If the matching degree is greater than or equal to the threshold, it is judged that the corresponding voice feature can be identified and the recognition result is output, otherwise, the voice feature is judged not to be recognized. The second decoding unit 52 is configured to identify the voice of the user according to the currently input user voice and the second voice database 33, and output a corresponding standard voice.

In this embodiment, the manual labeling unit 31 includes a prompting subunit 311, a selecting subunit 313, an input subunit 315, and an confirming subunit 317. The prompting sub-unit 311 is configured to periodically prompt the user to view the unrecognizable voice features stored in the first voice database 21. The selection sub-unit 313 is configured to allow a user to select a standard voice corresponding to the unrecognizable voice feature, wherein the standard voice is pre-stored in the first voice database 21. For example, the user can listen to the unknown Other specific speech, and then standard speech matching the speech feature is selected based on the provided standard speech. The input subunit 315 is configured to allow a user to input a standard voice corresponding to the unrecognizable voice feature. It can be understood that only one of the selection sub-unit 313 and the input sub-unit 315 can be selected for setting. When there is no corresponding option in the standard speech, the corresponding standard speech can be determined by means of voice input. . The confirmation subunit 317 is configured to allow a user to confirm a mapping relationship between the voice feature and the standard voice, and store the mapping relationship in the second voice database 33 after the confirmation is completed.

In the second embodiment, the feature extraction module 20, the model training module 30, the update module 40, and the identification module 50 are integrated in the cloud server 300a, and the identification module 50 respectively identifies voice data under different cloud modules. .

The speech recognition system 100a provided by the second embodiment only performs remodeling training on unrecognizable speech data, which can reduce data redundancy and improve recognition speed and efficiency.

In addition, the voice recognition system 100a (or 100) may further include an execution module 60, configured to generate a text in a specific format or play a corresponding standard voice according to the recognition result, and control a corresponding client according to the control command. device. In order to be able to run the speech recognition system 100a in different client devices 200, the speech recognition system 100a may further comprise a download module 70 for the user to update the acoustics and language in the corresponding private cloud module. The model is downloaded locally to implement speech recognition locally.

It can be understood that, in other embodiments, the identification module 50 may also store all the voice features in the first voice database 21 for the model training module 30 while identifying the voice features. The speech feature is extracted from the first speech database 21 at a timing to perform model training.

Referring to FIG. 4, an embodiment of the present invention provides a voice recognition method, where the method includes the following steps:

Step S401, inputting the voice of the user in real time based on enabling the real-time call or voice input function. Specifically, the real-time call or voice input function is implemented by using a mobile phone, an in-vehicle device, a computer, a mobile phone, a smart home device, a wearable device, and the like. At the same time, the user's voice can also be saved in real time for subsequent calls.

Step S403, extracting a voice feature from the input user voice. In this embodiment, the extracted voice features are saved in a first voice database 21 in real time. Wherein the first voice database 21 may be a local database or a cloud database, and the voice feature refers to feature data of the user voice.

Step S405, establishing corresponding acoustic and language models according to the voice features and preset rules, for matching and comparing the extracted voice features with the acoustic and language models in the subsequent recognition process to obtain the best Identification result.

In step S407, the acoustic and language models are saved and updated in real time into a model database 41, whereby a larger acoustic and language model database 41 can be acquired, and the degree of recognition is improved.

In this embodiment, step S401 is performed on the client device, for example, by using a microphone and its processing circuit for voice input. The step S403, the step S405, and the step S407 are performed in the cloud server 300. In order to be able to keep the user's voice information confidential and provide personalized model training for different user voice features, the cloud server further includes multiple private cloud accounts corresponding to different users, and each private cloud master account may be separately Steps S403 to S407 are performed, and when the user enables the voice recognition function, the method can be performed by using an account authentication method.

It can be understood that, in other embodiments, the steps S401-S407 can be performed on the client device 200, and the first voice database 21 and the model database 41 are local databases.

Referring to FIG. 5, in another embodiment, in addition to the foregoing steps S401-S407, the voice recognition method further includes:

In step S409, it is determined whether the voice feature can be identified according to the acoustic and language models in the model database 41. If the voice feature can be identified, step S411 is executed to generate a recognition result of the carry control command. Otherwise, step S413 is performed, and the process cannot be performed. The other recognized speech features are stored in the first speech database 21.

Specifically, referring to FIG. 6, the step S409 includes the following sub-steps:

Sub-step S409a, performing matching degree calculation on the voice feature with the acoustic and language models. If the matching degree is greater than or equal to the threshold, performing sub-step S409b, determining that the corresponding voice feature can be identified and outputting the recognition result; otherwise, Performing sub-step S409c, it is determined that the voice feature cannot be recognized.

Sub-step S409d, manually, according to a user command, manually mapping the unrecognizable voice feature whose matching degree is lower than the threshold value with a preset standard voice, and mapping the voice feature to the standard voice data and a mapping relationship thereof The update is in a second voice database 33.

At this time, the first voice database 21 only stores the voice features that are not recognized, so the voice recognition system 100 only needs to perform model training on the unrecognizable voice data, which can reduce data. Redundancy increases recognition speed and efficiency.

Referring to FIG. 7, in another embodiment, in combination with steps S401-S413, the method further includes:

Step S415, generating text in a specific format or playing a corresponding standard voice according to the recognition result, and controlling a corresponding client device according to the control command;

Step S417, downloading the updated acoustic and language models in the corresponding private cloud module to the local to implement voice recognition locally.

Moreover, in other embodiments, while the voice features are identified, all of the voice features may also be stored in the first voice database 21, from the first voice database in a timed, real-time or quantitative manner. 21 extracting the speech features to perform model training.

The speech recognition system and method of the present invention records or saves real-time call and recording information in real time, and as a sample of speech model training, thereby continuously updating the model database 41 according to different pronunciation characteristics of the user. Thereby, the user's individual needs can be satisfied, and a variety of voices, such as English or local dialects, can be supported, and the recognition degree is improved. In addition, the present invention also provides a private cloud master module (account) for different users, for the user to enable the voice recognition function by means of account authentication, thereby improving the privacy performance of the user voice information.

It should be noted that, through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary hardware platform, and of course, all can be implemented by hardware. Based on such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM, a magnetic disk, an optical disk, or the like. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or in some portions of the embodiments.

The above are only the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, and the equivalent changes made by the claims of the present invention are still within the scope of the present invention.

Claims

A speech recognition system, characterized in that the system comprises at least:

a voice input module for inputting a user's voice in real time when real-time call or voice input function is enabled;

a feature extraction module, configured to extract a voice feature from the input user voice;

a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule;

An update module for saving and updating the acoustic and language models into a model database.
The speech recognition system according to claim 1, wherein the feature extraction module saves the extracted speech features in a first speech database in real time, and the model training module periodically or quantitatively derives from the first speech. The speech features are extracted from the database for model training.
The speech recognition system according to claim 2, wherein the feature extraction module, the model training module, and the update module are integrated in a cloud server, and the cloud server includes a plurality of private cloud modules corresponding to different users. The specific voice feature extracted by the feature extraction module is saved to the corresponding private cloud module, and the model and update are established by the model training module and the update module, and the recognition module respectively identifies the voice data under different cloud modules.
The speech recognition system of claim 1 further comprising:

An identification module, configured to determine, according to the acoustic and language models in the model database, whether the voice feature can be identified, and if not, generate a recognition result carrying a control command; otherwise, store other unrecognizable voice features to In a first voice database, the model training module is re-trained.
A speech recognition system according to claim 4, comprising at least:

a first decoding unit, configured to perform a matching degree calculation on the voice feature and the acoustic and language model, and if the matching degree is greater than or equal to the threshold, determine that the corresponding voice feature can be identified and output the recognition result; otherwise, the determination cannot be Identifying the speech feature;

The model training module further includes a manual labeling unit, configured to manually map the unrecognizable voice features with the matching degree lower than the threshold to the preset standard voice according to a user command, and match the voice The feature and the standard voice data and their mapping relationship are stored in a second voice database.
A speech recognition system according to claim 5, wherein said manual labeling unit package include:

a prompting subunit, configured to periodically prompt the user to view unrecognized voice features stored in the first voice database;

Selecting a subunit for the user to select a standard voice corresponding to the unrecognizable voice feature, wherein the standard voice is pre-stored in the first voice database; and/or

An input subunit for the user to input a standard voice corresponding to the unrecognizable voice feature;

The confirmation subunit is configured to allow a user to confirm a mapping relationship between the unrecognizable voice feature and the standard voice, and store the data in the second voice database.
The speech recognition system according to claim 5, wherein the identification module further comprises a second decoding unit, configured to identify the voice of the user according to the currently input user voice and the second voice database, and Output the corresponding standard voice.
The speech recognition system according to claim 4, wherein said identification module stores said speech feature in said first speech database while said speech feature is being recognized, for the model training module to The first speech database extracts the speech features to perform model training.
The speech recognition system according to claim 4, wherein the functions of the feature extraction module, the model training module, the update module, and the identification module are respectively implemented by each private cloud module of a cloud server, wherein each private cloud module Corresponding to a user, the specific voice feature extracted by the feature extraction module is saved under the corresponding private cloud module.
The speech recognition system of claim 1 further comprising:

The download module is configured for the user to download the acoustic and language models in the corresponding private cloud module to the local to implement voice recognition locally.
A speech recognition method comprising:

Enter the user's voice in real time based on enabling real-time call or voice entry;

Extracting voice features from the input user voice;

Corresponding acoustic and language models are established according to the voice features and preset rules;

The acoustic and language models are saved and updated in real time into a model database.
A client device comprising the speech recognition system according to any one of claims 1 to 9.
A cloud server including a plurality of private cloud main modules corresponding to different users, each cloud The main module includes:

a feature extraction module, configured to extract a voice feature from a user voice input from a client device that is enabling real-time call or voice input function;

a model training module, configured to establish a corresponding acoustic and language model according to the voice feature and a preset rule;

An update module for saving and updating the acoustic and language models into a model database.