WO2017206661A1

WO2017206661A1 - Voice recognition method and system

Info

Publication number: WO2017206661A1
Application number: PCT/CN2017/083065
Authority: WO
Inventors: 许永昌; 盛阁
Original assignee: 深圳市鼎盛智能科技有限公司
Priority date: 2016-05-30
Filing date: 2017-05-04
Publication date: 2017-12-07
Also published as: CN105931633A

Abstract

A voice recognition method and system, the method comprising: a terminal obtains an inputted voice (S10); the terminal extracts sample characteristics of the inputted voice (S20); the terminal recognizes the inputted voice according to the sample characteristics and a preset local database, the local database comprising a deep learning-based voice recognition model and a voice recognition output value set obtained according to the voice recognition model (S30). The accuracy of voice recognition during a voice interaction process is improved.

Description

Method and system for speech recognition

Technical field

The present invention relates to the field of voice interaction, and in particular to a method and system for voice recognition.

Background technique

With the development of Internet information technology, the application of intelligent hardware has become more and more extensive, such as smart TV, smart bracelet, intelligent robot and so on. In intelligent hardware, in order to obtain information conveniently, most intelligent hardware manufacturers provide a human-computer interaction method such as voice interaction. When performing voice interaction, the intelligent hardware acquires the voice information input by the user, and then outputs corresponding information or executes corresponding instructions through voice recognition. When the speech recognition is inaccurate, the intelligent hardware cannot output the correct information or execute the correct instructions to reduce the user experience. Therefore, improving the accuracy of the speech recognition during the speech interaction process is an urgent problem to be solved.

Summary of the invention

The main object of the present invention is to provide a method and system for speech recognition, which aims to improve the accuracy of speech recognition in a speech interaction process.

To achieve the above object, a method for voice recognition provided by the present invention includes the following steps:

The terminal acquires an input voice;

The terminal extracts a sample feature of the input voice;

The terminal identifies the input speech according to the sample feature and a preset local database, and the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.

Preferably, the set of speech recognition output values includes a correspondence between a sample feature of the speech and a speech recognition output value;

The identifying, by the terminal, the input voice according to the sample feature and a preset local database includes:

The terminal searches for the sample of the voice recognition output value to see if there is a sample with the sample a speech recognition output value corresponding to the feature;

When the voice recognition output value corresponding to the sample feature is searched, the terminal acquires the voice recognition output value;

When the voice recognition output value corresponding to the sample feature is not searched, the terminal performs voice recognition according to the detected network signal strength of the terminal.

Preferably, the terminal searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature comprises:

The terminal determines whether the network signal strength is greater than a first preset value;

If yes, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition by using the own cloud server;

If not, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.

Preferably, the method further includes:

After the sample cloud feature is received by the own cloud server, searching is performed in the own cloud server;

When the identification result corresponding to the sample feature is searched in the own cloud server, the own cloud server acquires the recognition result;

When the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the sample feature to the cloud server when the cloud server detects that the network feature is greater than the second preset value. The third party voice server identifies the input voice through the third party voice recognition server.

Preferably, the terminal before searching for the voice recognition output value corresponding to the sample feature in the voice recognition output value according to the sample feature comprises:

The terminal compares the sample features with a preset sample library;

When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the terminal performs the step of searching in the local database according to the sample feature;

When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal performs the sending of the sample feature to the terminal's own cloud server, The steps of voice recognition by a cloud server are described.

In addition, in order to achieve the above object, the present invention further provides a system for voice recognition, the system comprising: a terminal;

The terminal includes:

Obtaining a module for acquiring an input voice;

a feature extraction module, configured to extract sample features of the input voice;

And a voice recognition module, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.

The speech recognition module includes:

Searching a sub-module for searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;

a first identification submodule, configured to acquire the speech recognition output value when the speech recognition output value corresponding to the sample feature is searched;

The second identification submodule is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.

Preferably, the system further includes an own cloud server corresponding to the terminal;

The second identification submodule includes:

a determining unit, configured to determine whether the network signal strength is greater than a first preset value;

a first identifying unit, configured to send the sample feature to the own cloud server when the network signal strength is greater than a first preset value, and perform voice recognition by using the own cloud server;

And a second identifying unit, configured to: when the network signal strength is less than the first preset value, input the sample feature to the voice recognition model, and output the predicted recognition result.

Preferably, the own cloud server comprises:

a search module, configured to perform a search in the own cloud server after receiving the sample feature by the own cloud server;

An identification module, configured to acquire the recognition result when searching for a recognition result corresponding to the sample feature in the own cloud server;

a sending module, configured to send the sample feature to a third-party voice if the network strength is greater than a second preset value when the recognition result corresponding to the sample feature is not found in the own cloud server The server identifies the input voice through the third party voice recognition server.

Preferably, the terminal further includes:

a comparison analysis module, configured to compare the sample features with a preset sample library;

a first triggering module, configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;

The first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition;

The embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.

DRAWINGS

1 is a schematic flow chart of steps of a first embodiment of a method for voice recognition according to the present invention;

2 is a schematic diagram of obtaining a certain sound through a basic sound structure in the embodiment shown in FIG. 1;

3 is a schematic diagram showing a target sound by sparse coding in the embodiment shown in FIG. 1;

4 is a schematic flow chart of the refinement step of step S30 in the embodiment shown in FIG. 1 according to the present invention;

FIG. 5 is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330 in the embodiment shown in FIG. 4;

6 is a schematic flowchart of a refinement step of performing voice recognition in a self-owned cloud server according to the present invention;

7 is a schematic structural diagram of a self-owned cloud server in the embodiment shown in FIG. 6 according to the present invention;

8 is a schematic diagram of functional modules of a first embodiment of a voice recognition system according to the present invention;

9 is a schematic diagram of a refinement function module of the speech recognition module 30 in the embodiment shown in FIG. 8;

10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9;

FIG. 11 is a schematic diagram of functional modules included in the own cloud server 12 of the present invention.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

detailed description

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The present invention provides a method of speech recognition. Referring to FIG. 1, in the first embodiment, the method includes:

Step S10: The terminal acquires an input voice.

Step S20, the terminal extracts sample features of the input voice;

Step S30, the terminal identifies the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.

The method for voice recognition provided by the present invention is for identifying an input voice in the case of voice interaction. In the voice interaction, it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal. The terminal can be understood as a carrier for receiving sound input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.

In the embodiment, the input voice is a voice input by a user during a voice interaction. After the terminal obtains the input voice, the voice is processed. Specifically, the obtained input voice exists in the form of voice data, and then the voice data is analyzed by spectrum, and then The extracted sample features are stored in the terminal. Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs. The feature extraction of the input speech is to further analyze the speech. The method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.

After extracting the sample features of the input speech, the input speech is identified based on the sample characteristics and the preset local database. The preset local database is a local database existing on the terminal. Without terminal networking, the database can be directly accessed to obtain information in the database, and the data stored in the local database can be understood as a secondary cache for storing sound data.

In the local database, a deep learning based speech recognition model and a speech recognition output value set obtained from the speech recognition model are included. Deep learning mainly uses features similar to artificial neural networks. Artificial neural networks have a hierarchical structure, and the hierarchy is progressive, and high-level expressions are composed of low-level expressions. Complete the level construction from shallow to deep. The essence of specific deep learning is to learn important features by constructing various learning models and massive training data, so as to improve the accuracy of judgment. During deep learning, various sounds are collected, and the collected sounds are extracted. The collected sounds are used as training sets, and the training set is used to improve the prediction accuracy of the model through continuous learning. The training process is to optimize the model. The process of weighting. After the sample has been trained to optimize the model, the input sound is input to the model, and an output value is obtained, which is a predicted value for identifying the input sound.

When establishing a speech recognition model, a sparse coding algorithm can be used. Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented. According to the research of the prior art, 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds. As shown in Fig. 2, the left side represents 20 basic sound structures, and the right side represents a certain sound synthesized based on the 20 basic sound structures, and the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition. The feature that represents the sound using sparse coding can be Target=SUM(a[k]*S[k]), where a[k] is the weight coefficient when the element S[k] is superimposed, and S[k] is One of the basic sound structures, as shown in FIG. 3, is a schematic diagram of Target=SUM (a[k]*S[k]) representing the target sound by sparse coding, where x is the sound of a certain time point, 0.9 is The weight coefficients, ie a[k], φ ₃₆ are one of the basic sound structures, namely S[k]. The sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model. There are many commonly used deep learning-based training algorithms that can be selected as needed.

The set of speech recognition output values obtained according to the speech recognition model refers to a set of output values obtained by extracting a plurality of pre-entered speech through feature extraction through a speech recognition model, and the output values represent recognition results of the input speech. Obtained according to the search at the time of identification.

Referring to FIG. 4, it is a schematic flowchart of the refinement step of step S30, where the foregoing step S30 includes:

Step S310, the terminal searches for the voice recognition output value corresponding to the sample feature in the voice recognition output value set; if yes, step S320 is performed; if not, step S330 is performed;

Step S320, the terminal acquires the voice recognition output value;

Step S330, the terminal performs voice recognition according to the detected network signal strength of the terminal.

In this embodiment, since the sample features of a set of speech are obtained by the speech recognition model, the corresponding relationship between the sample features of the speech and the speech recognition output value is included in the speech recognition output value set, and the speech recognition is performed. At the same time, the sample feature is searched in the set of speech recognition output values, and a preset search engine is used in the search to search whether there is a speech recognition output value corresponding to the sample feature.

When the voice recognition output value corresponding to the sample feature is searched, the voice recognition output value is obtained, and the voice recognition output value is a recognition result of the input voice of the segment, and then the recognition result may be output at the terminal, or may be outputted according to the output result. Performing a corresponding operation, for example, the voice interaction process is to control certain instructions of the intelligent robot, and then controlling the intelligent robot to perform a corresponding operation, or the voice interaction process is to perform some content retrieval in the browser, The corresponding retrieval process is performed based on the recognized result to display the retrieval result on the user terminal.

When the voice recognition output value corresponding to the sample feature is not searched, the voice recognition is performed according to the detected network signal strength of the terminal. The method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again. The purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.

In this embodiment, the terminal searches for a voice recognition output value corresponding to the sample feature in the voice recognition output value set, and when the voice recognition output value corresponding to the sample feature is retrieved, the voice recognition output value is obtained, thereby improving the accuracy of the recognition. degree. At the same time, when the speech recognition output value corresponding to the sample feature is not retrieved, the speech recognition is performed according to the strength of the network signal of the terminal, to avoid attempting to send the sample feature to other servers when the network signal is not strong, or waiting for other servers to return the connection request, thereby improving The speed of speech recognition.

Referring to FIG. 5, it is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330, and performing voice recognition according to the detected network signal strength of the terminal in step S330. :

Step S331, the terminal determines whether the network signal strength is greater than the first preset value; if yes, step S332 is performed; if not, step S333 is performed;

Step S332, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server.

Step S333, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.

This embodiment is a refinement step of the terminal performing voice recognition according to the strength of the network signal. Specifically, it is determined whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.

When the network signal strength is greater than the first preset value, it indicates that the network environment of the terminal is good at this time. At this time, the sample feature is sent to the own cloud server of the terminal, and the voice recognition is performed by the own cloud server. The above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.

When the network signal strength is not greater than the first preset value at this time, it indicates that the network environment may be poor at this time, and sending the sample feature to the own cloud server may not be successfully sent. Therefore, the prediction is directly performed according to the speech recognition model in the local database. Identification result. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.

In this embodiment, the terminal determines whether the network signal strength is greater than the first preset value, and if yes, sends the sample feature to the terminal's own cloud server, and searches in the own cloud server, when the network signal is not good, according to the voice Identify the recognition result of the model output prediction, avoid the corresponding delay, and improve the accuracy of network recognition. Combined with the method of searching in the local database and then searching in the own cloud server, the speed of speech recognition is also improved.

Referring to FIG. 6 , a schematic flowchart of a refinement step for voice recognition in a self-owned cloud server, in this embodiment, the method for voice recognition provided by the present invention further includes:

Step S101: After the sample cloud feature is received by the own cloud server, perform a search in the own cloud server.

Step S102, when the identification result corresponding to the sample feature is searched in the own cloud server, the own cloud server acquires the recognition result;

In step S103, when the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the The sample feature is sent to a third party voice server, and the input voice is identified by the third party voice recognition server.

In this embodiment, the process of performing speech recognition in a self-owned cloud server is mainly described.

After the terminal sends the sample feature to the own cloud server, the search is performed in the own cloud server, and when the recognition result corresponding to the sample feature is searched, the recognition result is obtained.

It is also possible to store a more complex deep learning-based speech recognition model than the local database in the own cloud server, and the output value obtained from the speech recognition model, because The self-owned cloud server is deployed in the cloud and usually has multiple distributed cache servers with more computing power. At the same time, the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and faster. At the same time, when the result of the speech recognition is obtained, the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .

When the recognition result corresponding to the sample feature is not retrieved in the own cloud server, if the network strength is greater than the second preset value, that is, the network strength of the terminal is greater than the second preset value, the cloud server sends the sample feature to the first Three-party voice server. The second preset value may be set as needed, and the second preset value may be the same as the first preset value, or may be different from the second preset value because accessing the own cloud server and accessing the third party The network speed required by the cloud server may be different. The third-party voice server is a server with stronger voice recognition capability. Usually, the third-party voice server can provide a server provided by a vendor that provides voice recognition services, such as a voice recognition cloud server provided by the University of Science and Technology.

It can be understood that the classification analysis can also be performed according to the characteristics of the user's age, identity, etc., and the database is established to make the recognition result more accurate.

In implementation, the architecture of the own cloud server is as shown in Figure 7. The deployment of the CDN server improves the difference in access speed between different regions. At the same time, the CDN server is also responsible for returning the data in the searched cache, and the user accesses the CDN server. After reaching the reverse proxy server, it is sent to the application server through the balanced load server. The balanced load server can adapt to concurrent access of a large number of users, realize data offloading, and improve stability. It is also possible to add a local cache on the application server to quickly respond to the recognition result based on historical identification. Through the cooperation of the search engine and the non-relational database in the process of voice interaction, the database server can also be set to store the accounts and settings of a large number of users. At the same time, the interface is connected to the third-party voice recognition server, and the recognition capability of the third voice recognition server is combined to improve the accuracy of the recognition and improve the user experience.

In this embodiment, after receiving the sample feature in the own cloud server, the self-owned cloud service is provided. Searching in the device, when the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained. When the recognition result is not detected and the network strength of the terminal is greater than the second preset value, the cloud server sends the sample. Features to third-party voice servers that improve the accuracy of speech recognition by identifying them on third-party voice servers. Moreover, sending the sample features to the third-party voice server only when the network environment is good improves the response delay in the speech recognition process.

In this embodiment, the foregoing step S310 includes:

The terminal compares the sample features with a preset sample library;

When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, step S310 is performed;

When the similarity between the sample feature and any of the preset sample features in the sample library is lower than a preset value, step S332 is performed.

In this embodiment, after the input voice is acquired and the sample features of the input voice are extracted, the sample features are compared with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample features to the self. There is a search in the cloud server. The preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.

When the similarity between the sample feature and the preset sample feature in the sample library is higher than the preset value, step S310 is performed, that is, the terminal identifies the input voice according to the sample feature and a local database preset in the terminal, where the preset sample refers to The sample library matches the sample feature to a sample whose similarity is greater than the preset value. When the similarity between the sample feature and the preset sample feature in the sample library is lower than the preset value, step S332 is performed, that is, the terminal sends the sample feature to the terminal's own cloud server, through the own cloud. The server performs speech recognition. Here, any sample feature matching is smaller than the preset value, which means that the matching degree between the sample feature and the sample library is less than a preset value.

The specific preset value may be set as needed, for example, may be set to 80%, when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, step S310 is performed, when the sample feature and the sample library are used. When the similarity of the preset sample features is less than 80%, step S332 is performed.

In this embodiment, by comparing the sample features with the preset sample library, when the similarity between the sample features and the preset sample features in the sample library is higher than a preset value, the search is performed in the local database according to the sample features, and the sample is When the similarity between the feature and any sample feature in the sample library is lower than the preset value, the sample feature is directly sent to the own cloud server for matching, and the speed of the voice recognition is improved while ensuring the accuracy of the voice recognition.

The present invention also provides a system for voice recognition. Referring to FIG. 8, a first embodiment of a system for voice recognition according to the present invention is provided. In this embodiment, the system for voice recognition includes a terminal 11: the terminal includes:

The obtaining module 10 is configured to acquire an input voice;

a feature extraction module 20, configured to extract sample features of the input voice;

a voice recognition module 30, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model .

The system for voice recognition provided by the present invention is for identifying an input voice in the case of voice interaction. In the voice interaction, it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal. The system for voice recognition includes a terminal, and the terminal can be understood as a carrier for receiving voice input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.

In the embodiment, the input voice is a voice input by a user during a voice interaction. After the acquisition module 10 obtains the input voice, the feature extraction module 20 processes the voice. Specifically, the acquired input voice exists in the form of sound data, and then the sound data is subjected to spectrum analysis, and then the sample feature is extracted. terminal. Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs. The feature extraction of the input speech is to further analyze the speech. The method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.

After extracting the sample features of the input speech, the speech recognition module 30 identifies the input speech based on the sample features and the preset local database. The preset local database is present in the terminal The local database on the network can access the database directly without the terminal networking, and obtain the information in the database. The data saved in the local database can be understood as the secondary cache for storing the sound data.

When establishing a speech recognition model, a sparse coding algorithm can be used. Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented. According to the research of the prior art, 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds. As shown in Fig. 2, the left side represents 20 basic sound structures, and the right side represents a certain sound synthesized based on the 20 basic sound structures, and the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition. The feature that represents the sound using sparse coding can be Target=SUM(a[k]*S[k]), where a[k] is the weight coefficient when the element S[k] is superimposed, and S[k] is One of the basic sound structures, as shown in FIG. 3, is a schematic diagram showing Target=SUM(a[k]*S[k]) of the target sound by sparse coding, where x is the sound of a certain time point, 0.9 is The weight coefficients, ie a[k], φ ₃₆ are one of the basic sound structures, namely S[k]. The sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model. There are many commonly used deep learning-based training algorithms that can be selected as needed.

The set of speech recognition output values obtained according to the speech recognition model refers to The pre-inputted speech extracts a set of output values obtained through the speech recognition model through feature extraction, and these output values represent the recognition result of the input speech, which can be obtained according to the search at the time of recognition.

9 is a schematic diagram of a refinement function module of the speech recognition module 30 in the embodiment shown in FIG. 8. The speech recognition module 30 includes:

Search sub-module 310, configured to search, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;

The first identifier sub-module 320 is configured to acquire the voice recognition output value when the voice recognition output value corresponding to the sample feature is searched;

The second identification sub-module 330 is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.

When the search sub-module 310 searches for the voice recognition output value corresponding to the sample feature, the first recognition sub-module 320 acquires the voice recognition output value, and the voice recognition output value is the recognition result of the segment input voice, and then The terminal outputs the recognition result, and may also perform a corresponding operation according to the output result. For example, if the voice interaction process is to control certain instructions of the intelligent robot, then the intelligent robot is controlled to perform the corresponding operation, or The voice interaction process is to perform retrieval of some content in the browser, and then execute a corresponding retrieval process according to the recognized result to display the retrieval result on the user terminal.

When the voice recognition output value corresponding to the sample feature is not searched, the second recognition sub-module 330 performs voice recognition according to the detected network signal strength of the terminal. The method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again. The purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.

10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9, the system further includes an own cloud server 12 corresponding to the terminal;

The second identification submodule 330 includes:

The determining unit 331 is configured to determine whether the network signal strength is greater than a first preset value;

The first identifying unit 332 is configured to: when the network signal strength is greater than the first preset value, send the sample feature to the own cloud server, and perform voice recognition by using the own cloud server;

The second identifying unit 333 is configured to input the sample feature to the voice recognition model when the network signal strength is less than the first preset value, and output the predicted recognition result.

The function module provided in this embodiment is used for voice recognition according to the network signal strength. Specifically, the determining unit 331 determines whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.

When the network signal strength is greater than the first preset value, it indicates that the network environment of the terminal is good at this time, and the first identifying unit 332 sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server. The above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.

When the network signal strength is not greater than the first preset value at this time, it indicates that the network environment can be Poorly, the sending of sample features to the own cloud server may not be successful. Therefore, the second identifying unit 333 obtains the predicted recognition result according to the speech recognition model in the local database. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.

Referring to FIG. 11 , it is a schematic diagram of a function module included in the cloud server 12 . In this embodiment, the self-owned cloud server 12 includes:

The searching module 201 is configured to perform a search in the own cloud server after the sample cloud feature is received by the own cloud server;

The identification module 202 is configured to acquire the recognition result when the identification result corresponding to the sample feature is searched in the own cloud server;

The sending module 203 is configured to: when the identification result corresponding to the sample feature is not found in the own cloud server, send the sample feature to a third party if it is detected that the network strength is greater than a second preset value a voice server that recognizes the input voice through the third party voice recognition server.

In this embodiment, after the terminal sends the sample feature to the own cloud server, the search is performed in the own cloud server. When the recognition result corresponding to the sample feature is searched, the first identification subunit acquires the recognition result.

In the self-owned cloud server, it is also possible to store a more complex deep learning-based speech recognition model than the local database, and the output value obtained according to the speech recognition model, since the self-owned cloud server is deployed in the cloud, usually has multiple distributed Cache server, more computing power. At the same time, the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and fast. At the same time, when the result of the speech recognition is obtained, the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .

In this embodiment, after the sample feature is received by the own cloud server, the search is performed in the own cloud server. When the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained, and when the recognition is not detected, When the network strength of the terminal is greater than the second preset value, the cloud server sends the sample feature to the third-party voice server, and the accuracy of the voice recognition is improved by identifying the third-party voice server. Moreover, sending sample features to third-party voice servers only improves when the network environment is good. Response delay during tone recognition.

In this embodiment, the terminal 11 in the voice recognition system of the present invention further includes:

In this embodiment, after the input voice is acquired and the sample features of the input voice are extracted, the comparison analysis module compares the sample features with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample. Features to search in your own cloud server. The preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.

When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the first trigger module triggers the search sub-module 310 to identify the input voice according to the sample feature and a local database preset in the terminal, where the input is preset. A sample is a sample in a sample library that matches a sample feature to a similarity greater than a preset value. When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal sends the sample feature to the terminal's own cloud server, and performs voice recognition through the own cloud server, where is any sample feature. A match that is less than the preset value means that the match between the sample feature and the sample library is less than the preset value.

The specific preset value may be set as needed, for example, may be set to 80%, and when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, the trigger search sub-module 310 is triggered according to the sample feature and the terminal. The local database preset in the identifier identifies the input voice. When the similarity between the sample feature and the preset sample feature in the sample library is less than 80%, the terminal sends the sample feature to the terminal's own cloud server, and performs the operation through the own cloud server. Speech Recognition.

In this embodiment, the sample features are compared with a preset sample library, and the sample is used as a sample. When the similarity between the feature and the preset sample feature in the sample library is higher than the preset value, the search is performed in the local database according to the sample feature, and when the similarity between the sample feature and any sample feature in the sample library is lower than a preset value, Directly sending sample features to the own cloud server for matching improves the speed of speech recognition while ensuring the accuracy of speech recognition.

The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.

Claims

A method for speech recognition, characterized in that the method comprises the following steps:

The terminal acquires an input voice;

The terminal extracts a sample feature of the input voice;

The terminal identifies the input speech according to the sample feature and a preset local database, and the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
The method of claim 1 wherein said set of speech recognition output values comprises a correspondence between sample features of speech and speech recognition output values;

The identifying, by the terminal, the input voice according to the sample feature and a preset local database includes:

The terminal searches, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;

When the voice recognition output value corresponding to the sample feature is searched, the terminal acquires the voice recognition output value;

When the voice recognition output value corresponding to the sample feature is not searched, the terminal performs voice recognition according to the detected network signal strength of the terminal.
The method of claim 2, wherein the searching, by the terminal, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature comprises:

The terminal determines whether the network signal strength is greater than a first preset value;

If yes, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition by using the own cloud server;

If not, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
The method of claim 3, wherein the method further comprises:

After the sample cloud feature is received by the own cloud server, searching is performed in the own cloud server;

Searching for the recognition result corresponding to the sample feature in the own cloud server And obtaining, by the own cloud server, the identification result;

When the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the sample feature to the cloud server when the cloud server detects that the network feature is greater than the second preset value. The third party voice server identifies the input voice through the third party voice recognition server.
The method according to claim 4, wherein the terminal before searching for the speech recognition output value corresponding to the sample feature in the speech recognition output value according to the sample feature comprises:

The terminal compares the sample features with a preset sample library;

When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the terminal performs the step of searching in the local database according to the sample feature;

When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal performs the sending of the sample feature to the terminal's own cloud server, The steps of voice recognition by a cloud server are described.
A system for speech recognition, characterized in that the system comprises: a terminal;

The terminal includes:

Obtaining a module for acquiring an input voice;

a feature extraction module, configured to extract sample features of the input voice;

And a voice recognition module, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
The system of claim 6 wherein said set of speech recognition output values comprises a correspondence between sample features of speech and speech recognition output values;

The speech recognition module includes:

Searching a sub-module for searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;

a first identification submodule, configured to acquire the speech recognition output value when the speech recognition output value corresponding to the sample feature is searched;

a second identification submodule, configured to: when the speech recognition corresponding to the sample feature is not found When the value is output, speech recognition is performed based on the detected network signal strength of the terminal.
The system according to claim 7, wherein the system further comprises an own cloud server corresponding to the terminal;

The second identification submodule includes:

a determining unit, configured to determine whether the network signal strength is greater than a first preset value;

a first identifying unit, configured to send the sample feature to the own cloud server when the network signal strength is greater than a first preset value, and perform voice recognition by using the own cloud server;

And a second identifying unit, configured to: when the network signal strength is less than the first preset value, input the sample feature to the voice recognition model, and output the predicted recognition result.
The system of claim 8 wherein said own cloud server comprises:

a search module, configured to perform a search in the own cloud server after receiving the sample feature by the own cloud server;

An identification module, configured to acquire the recognition result when searching for a recognition result corresponding to the sample feature in the own cloud server;

a sending module, configured to send the sample feature to a third-party voice if the network strength is greater than a second preset value when the recognition result corresponding to the sample feature is not found in the own cloud server The server identifies the input voice through the third party voice recognition server.
The system of claim 9, wherein the terminal further comprises:

a comparison analysis module, configured to compare the sample features with a preset sample library;

a first triggering module, configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;

The first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition.