WO2017206661A1 - Voice recognition method and system - Google Patents

Voice recognition method and system Download PDF

Info

Publication number
WO2017206661A1
WO2017206661A1 PCT/CN2017/083065 CN2017083065W WO2017206661A1 WO 2017206661 A1 WO2017206661 A1 WO 2017206661A1 CN 2017083065 W CN2017083065 W CN 2017083065W WO 2017206661 A1 WO2017206661 A1 WO 2017206661A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample feature
sample
terminal
speech recognition
cloud server
Prior art date
Application number
PCT/CN2017/083065
Other languages
French (fr)
Chinese (zh)
Inventor
许永昌
盛阁
Original Assignee
深圳市鼎盛智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市鼎盛智能科技有限公司 filed Critical 深圳市鼎盛智能科技有限公司
Publication of WO2017206661A1 publication Critical patent/WO2017206661A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present invention relates to the field of voice interaction, and in particular to a method and system for voice recognition.
  • intelligent hardware in order to obtain information conveniently, most intelligent hardware manufacturers provide a human-computer interaction method such as voice interaction.
  • voice interaction the intelligent hardware acquires the voice information input by the user, and then outputs corresponding information or executes corresponding instructions through voice recognition.
  • voice recognition the intelligent hardware cannot output the correct information or execute the correct instructions to reduce the user experience. Therefore, improving the accuracy of the speech recognition during the speech interaction process is an urgent problem to be solved.
  • the main object of the present invention is to provide a method and system for speech recognition, which aims to improve the accuracy of speech recognition in a speech interaction process.
  • a method for voice recognition includes the following steps:
  • the terminal acquires an input voice
  • the terminal extracts a sample feature of the input voice
  • the terminal identifies the input speech according to the sample feature and a preset local database, and the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
  • the set of speech recognition output values includes a correspondence between a sample feature of the speech and a speech recognition output value
  • the identifying, by the terminal, the input voice according to the sample feature and a preset local database includes:
  • the terminal searches for the sample of the voice recognition output value to see if there is a sample with the sample a speech recognition output value corresponding to the feature;
  • the terminal acquires the voice recognition output value
  • the terminal When the voice recognition output value corresponding to the sample feature is not searched, the terminal performs voice recognition according to the detected network signal strength of the terminal.
  • the terminal searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature comprises:
  • the terminal determines whether the network signal strength is greater than a first preset value
  • the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition by using the own cloud server;
  • the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
  • the method further includes:
  • the own cloud server acquires the recognition result
  • the cloud server When the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the sample feature to the cloud server when the cloud server detects that the network feature is greater than the second preset value.
  • the third party voice server identifies the input voice through the third party voice recognition server.
  • the terminal before searching for the voice recognition output value corresponding to the sample feature in the voice recognition output value according to the sample feature comprises:
  • the terminal compares the sample features with a preset sample library
  • the terminal When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the terminal performs the step of searching in the local database according to the sample feature;
  • the terminal When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal performs the sending of the sample feature to the terminal's own cloud server, The steps of voice recognition by a cloud server are described.
  • the present invention further provides a system for voice recognition, the system comprising: a terminal;
  • the terminal includes:
  • a feature extraction module configured to extract sample features of the input voice
  • a voice recognition module configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
  • the set of speech recognition output values includes a correspondence between a sample feature of the speech and a speech recognition output value
  • the speech recognition module includes:
  • a first identification submodule configured to acquire the speech recognition output value when the speech recognition output value corresponding to the sample feature is searched
  • the second identification submodule is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.
  • the system further includes an own cloud server corresponding to the terminal;
  • the second identification submodule includes:
  • a determining unit configured to determine whether the network signal strength is greater than a first preset value
  • a first identifying unit configured to send the sample feature to the own cloud server when the network signal strength is greater than a first preset value, and perform voice recognition by using the own cloud server;
  • a second identifying unit configured to: when the network signal strength is less than the first preset value, input the sample feature to the voice recognition model, and output the predicted recognition result.
  • the own cloud server comprises:
  • a search module configured to perform a search in the own cloud server after receiving the sample feature by the own cloud server;
  • An identification module configured to acquire the recognition result when searching for a recognition result corresponding to the sample feature in the own cloud server
  • a sending module configured to send the sample feature to a third-party voice if the network strength is greater than a second preset value when the recognition result corresponding to the sample feature is not found in the own cloud server
  • the server identifies the input voice through the third party voice recognition server.
  • the terminal further includes:
  • a comparison analysis module configured to compare the sample features with a preset sample library
  • a first triggering module configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;
  • the first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition;
  • the embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
  • FIG. 1 is a schematic flow chart of steps of a first embodiment of a method for voice recognition according to the present invention
  • FIG. 2 is a schematic diagram of obtaining a certain sound through a basic sound structure in the embodiment shown in FIG. 1;
  • FIG. 3 is a schematic diagram showing a target sound by sparse coding in the embodiment shown in FIG. 1;
  • step S30 is a schematic flow chart of the refinement step of step S30 in the embodiment shown in FIG. 1 according to the present invention
  • FIG. 5 is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330 in the embodiment shown in FIG. 4;
  • FIG. 6 is a schematic flowchart of a refinement step of performing voice recognition in a self-owned cloud server according to the present invention
  • FIG. 7 is a schematic structural diagram of a self-owned cloud server in the embodiment shown in FIG. 6 according to the present invention.
  • FIG. 8 is a schematic diagram of functional modules of a first embodiment of a voice recognition system according to the present invention.
  • FIG. 9 is a schematic diagram of a refinement function module of the speech recognition module 30 in the embodiment shown in FIG. 8;
  • FIG. 10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9;
  • FIG. 11 is a schematic diagram of functional modules included in the own cloud server 12 of the present invention.
  • the present invention provides a method of speech recognition.
  • the method includes:
  • Step S10 The terminal acquires an input voice.
  • Step S20 the terminal extracts sample features of the input voice
  • Step S30 the terminal identifies the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
  • the method for voice recognition provided by the present invention is for identifying an input voice in the case of voice interaction.
  • voice interaction it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal.
  • the terminal can be understood as a carrier for receiving sound input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.
  • the input voice is a voice input by a user during a voice interaction.
  • the voice is processed. Specifically, the obtained input voice exists in the form of voice data, and then the voice data is analyzed by spectrum, and then The extracted sample features are stored in the terminal.
  • Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs.
  • the feature extraction of the input speech is to further analyze the speech. The method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.
  • the preset local database is a local database existing on the terminal. Without terminal networking, the database can be directly accessed to obtain information in the database, and the data stored in the local database can be understood as a secondary cache for storing sound data.
  • Deep learning mainly uses features similar to artificial neural networks. Artificial neural networks have a hierarchical structure, and the hierarchy is progressive, and high-level expressions are composed of low-level expressions. Complete the level construction from shallow to deep. The essence of specific deep learning is to learn important features by constructing various learning models and massive training data, so as to improve the accuracy of judgment.
  • various sounds are collected, and the collected sounds are extracted. The collected sounds are used as training sets, and the training set is used to improve the prediction accuracy of the model through continuous learning.
  • the training process is to optimize the model. The process of weighting. After the sample has been trained to optimize the model, the input sound is input to the model, and an output value is obtained, which is a predicted value for identifying the input sound.
  • Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented.
  • 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds.
  • the left side represents 20 basic sound structures
  • the right side represents a certain sound synthesized based on the 20 basic sound structures
  • the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition.
  • the weight coefficients, ie a[k], ⁇ 36 are one of the basic sound structures, namely S[k].
  • the sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model.
  • the preset training algorithm There are many commonly used deep learning-based training algorithms that can be selected as needed.
  • the set of speech recognition output values obtained according to the speech recognition model refers to a set of output values obtained by extracting a plurality of pre-entered speech through feature extraction through a speech recognition model, and the output values represent recognition results of the input speech. Obtained according to the search at the time of identification.
  • the embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
  • step S30 includes:
  • Step S310 the terminal searches for the voice recognition output value corresponding to the sample feature in the voice recognition output value set; if yes, step S320 is performed; if not, step S330 is performed;
  • Step S320 the terminal acquires the voice recognition output value
  • Step S330 the terminal performs voice recognition according to the detected network signal strength of the terminal.
  • the sample features of a set of speech are obtained by the speech recognition model, the corresponding relationship between the sample features of the speech and the speech recognition output value is included in the speech recognition output value set, and the speech recognition is performed.
  • the sample feature is searched in the set of speech recognition output values, and a preset search engine is used in the search to search whether there is a speech recognition output value corresponding to the sample feature.
  • the voice recognition output value corresponding to the sample feature is searched, the voice recognition output value is obtained, and the voice recognition output value is a recognition result of the input voice of the segment, and then the recognition result may be output at the terminal, or may be outputted according to the output result.
  • the voice interaction process is to control certain instructions of the intelligent robot, and then controlling the intelligent robot to perform a corresponding operation, or the voice interaction process is to perform some content retrieval in the browser, The corresponding retrieval process is performed based on the recognized result to display the retrieval result on the user terminal.
  • the voice recognition output value corresponding to the sample feature is not searched, the voice recognition is performed according to the detected network signal strength of the terminal.
  • the method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again.
  • the purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.
  • the terminal searches for a voice recognition output value corresponding to the sample feature in the voice recognition output value set, and when the voice recognition output value corresponding to the sample feature is retrieved, the voice recognition output value is obtained, thereby improving the accuracy of the recognition. degree.
  • the speech recognition output value corresponding to the sample feature is not retrieved, the speech recognition is performed according to the strength of the network signal of the terminal, to avoid attempting to send the sample feature to other servers when the network signal is not strong, or waiting for other servers to return the connection request, thereby improving The speed of speech recognition.
  • FIG. 5 it is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330, and performing voice recognition according to the detected network signal strength of the terminal in step S330. :
  • Step S331 the terminal determines whether the network signal strength is greater than the first preset value; if yes, step S332 is performed; if not, step S333 is performed;
  • Step S332 the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server.
  • Step S333 the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
  • This embodiment is a refinement step of the terminal performing voice recognition according to the strength of the network signal. Specifically, it is determined whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.
  • the network signal strength is greater than the first preset value, it indicates that the network environment of the terminal is good at this time.
  • the sample feature is sent to the own cloud server of the terminal, and the voice recognition is performed by the own cloud server.
  • the above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.
  • the prediction is directly performed according to the speech recognition model in the local database. Identification result. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.
  • the terminal determines whether the network signal strength is greater than the first preset value, and if yes, sends the sample feature to the terminal's own cloud server, and searches in the own cloud server, when the network signal is not good, according to the voice Identify the recognition result of the model output prediction, avoid the corresponding delay, and improve the accuracy of network recognition. Combined with the method of searching in the local database and then searching in the own cloud server, the speed of speech recognition is also improved.
  • FIG. 6 a schematic flowchart of a refinement step for voice recognition in a self-owned cloud server, in this embodiment, the method for voice recognition provided by the present invention further includes:
  • Step S101 After the sample cloud feature is received by the own cloud server, perform a search in the own cloud server.
  • Step S102 when the identification result corresponding to the sample feature is searched in the own cloud server, the own cloud server acquires the recognition result;
  • step S103 when the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the The sample feature is sent to a third party voice server, and the input voice is identified by the third party voice recognition server.
  • the search is performed in the own cloud server, and when the recognition result corresponding to the sample feature is searched, the recognition result is obtained.
  • the self-owned cloud server is deployed in the cloud and usually has multiple distributed cache servers with more computing power.
  • the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and faster.
  • the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .
  • the cloud server sends the sample feature to the first Three-party voice server.
  • the second preset value may be set as needed, and the second preset value may be the same as the first preset value, or may be different from the second preset value because accessing the own cloud server and accessing the third party
  • the network speed required by the cloud server may be different.
  • the third-party voice server is a server with stronger voice recognition capability.
  • the third-party voice server can provide a server provided by a vendor that provides voice recognition services, such as a voice recognition cloud server provided by the University of Science and Technology.
  • classification analysis can also be performed according to the characteristics of the user's age, identity, etc., and the database is established to make the recognition result more accurate.
  • the architecture of the own cloud server is as shown in Figure 7.
  • the deployment of the CDN server improves the difference in access speed between different regions.
  • the CDN server is also responsible for returning the data in the searched cache, and the user accesses the CDN server.
  • After reaching the reverse proxy server it is sent to the application server through the balanced load server.
  • the balanced load server can adapt to concurrent access of a large number of users, realize data offloading, and improve stability. It is also possible to add a local cache on the application server to quickly respond to the recognition result based on historical identification.
  • the database server can also be set to store the accounts and settings of a large number of users.
  • the interface is connected to the third-party voice recognition server, and the recognition capability of the third voice recognition server is combined to improve the accuracy of the recognition and improve the user experience.
  • the self-owned cloud service after receiving the sample feature in the own cloud server, the self-owned cloud service is provided.
  • Searching in the device when the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained.
  • the cloud server sends the sample.
  • Third-party voice servers that improve the accuracy of speech recognition by identifying them on third-party voice servers.
  • sending the sample features to the third-party voice server only when the network environment is good improves the response delay in the speech recognition process.
  • step S310 includes:
  • the terminal compares the sample features with a preset sample library
  • step S310 is performed;
  • step S332 is performed.
  • the sample features are compared with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample features to the self. There is a search in the cloud server.
  • the preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.
  • step S310 When the similarity between the sample feature and the preset sample feature in the sample library is higher than the preset value, step S310 is performed, that is, the terminal identifies the input voice according to the sample feature and a local database preset in the terminal, where the preset sample refers to The sample library matches the sample feature to a sample whose similarity is greater than the preset value.
  • step S332 is performed, that is, the terminal sends the sample feature to the terminal's own cloud server, through the own cloud.
  • the server performs speech recognition.
  • any sample feature matching is smaller than the preset value, which means that the matching degree between the sample feature and the sample library is less than a preset value.
  • the specific preset value may be set as needed, for example, may be set to 80%, when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, step S310 is performed, when the sample feature and the sample library are used. When the similarity of the preset sample features is less than 80%, step S332 is performed.
  • the search is performed in the local database according to the sample features, and the sample is
  • the similarity between the feature and any sample feature in the sample library is lower than the preset value
  • the sample feature is directly sent to the own cloud server for matching, and the speed of the voice recognition is improved while ensuring the accuracy of the voice recognition.
  • the present invention also provides a system for voice recognition.
  • a first embodiment of a system for voice recognition according to the present invention is provided.
  • the system for voice recognition includes a terminal 11: the terminal includes:
  • the obtaining module 10 is configured to acquire an input voice
  • a feature extraction module configured to extract sample features of the input voice
  • a voice recognition module 30 configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model .
  • the system for voice recognition is for identifying an input voice in the case of voice interaction.
  • voice interaction it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal.
  • the system for voice recognition includes a terminal, and the terminal can be understood as a carrier for receiving voice input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.
  • the input voice is a voice input by a user during a voice interaction.
  • the feature extraction module 20 processes the voice. Specifically, the acquired input voice exists in the form of sound data, and then the sound data is subjected to spectrum analysis, and then the sample feature is extracted. terminal. Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs.
  • the feature extraction of the input speech is to further analyze the speech.
  • the method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.
  • the speech recognition module 30 After extracting the sample features of the input speech, the speech recognition module 30 identifies the input speech based on the sample features and the preset local database.
  • the preset local database is present in the terminal
  • the local database on the network can access the database directly without the terminal networking, and obtain the information in the database.
  • the data saved in the local database can be understood as the secondary cache for storing the sound data.
  • Deep learning mainly uses features similar to artificial neural networks. Artificial neural networks have a hierarchical structure, and the hierarchy is progressive, and high-level expressions are composed of low-level expressions. Complete the level construction from shallow to deep. The essence of specific deep learning is to learn important features by constructing various learning models and massive training data, so as to improve the accuracy of judgment.
  • various sounds are collected, and the collected sounds are extracted. The collected sounds are used as training sets, and the training set is used to improve the prediction accuracy of the model through continuous learning.
  • the training process is to optimize the model. The process of weighting. After the sample has been trained to optimize the model, the input sound is input to the model, and an output value is obtained, which is a predicted value for identifying the input sound.
  • Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented.
  • 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds.
  • the left side represents 20 basic sound structures
  • the right side represents a certain sound synthesized based on the 20 basic sound structures
  • the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition.
  • the sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model.
  • the preset training algorithm There are many commonly used deep learning-based training algorithms that can be selected as needed.
  • the set of speech recognition output values obtained according to the speech recognition model refers to
  • the pre-inputted speech extracts a set of output values obtained through the speech recognition model through feature extraction, and these output values represent the recognition result of the input speech, which can be obtained according to the search at the time of recognition.
  • the embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
  • the speech recognition module 30 includes:
  • Search sub-module 310 configured to search, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;
  • the first identifier sub-module 320 is configured to acquire the voice recognition output value when the voice recognition output value corresponding to the sample feature is searched;
  • the second identification sub-module 330 is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.
  • the sample features of a set of speech are obtained by the speech recognition model, the corresponding relationship between the sample features of the speech and the speech recognition output value is included in the speech recognition output value set, and the speech recognition is performed.
  • the sample feature is searched in the set of speech recognition output values, and a preset search engine is used in the search to search whether there is a speech recognition output value corresponding to the sample feature.
  • the first recognition sub-module 320 acquires the voice recognition output value, and the voice recognition output value is the recognition result of the segment input voice, and then The terminal outputs the recognition result, and may also perform a corresponding operation according to the output result. For example, if the voice interaction process is to control certain instructions of the intelligent robot, then the intelligent robot is controlled to perform the corresponding operation, or The voice interaction process is to perform retrieval of some content in the browser, and then execute a corresponding retrieval process according to the recognized result to display the retrieval result on the user terminal.
  • the second recognition sub-module 330 performs voice recognition according to the detected network signal strength of the terminal.
  • the method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again.
  • the purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.
  • the terminal searches for a voice recognition output value corresponding to the sample feature in the voice recognition output value set, and when the voice recognition output value corresponding to the sample feature is retrieved, the voice recognition output value is obtained, thereby improving the accuracy of the recognition. degree.
  • the speech recognition output value corresponding to the sample feature is not retrieved, the speech recognition is performed according to the strength of the network signal of the terminal, to avoid attempting to send the sample feature to other servers when the network signal is not strong, or waiting for other servers to return the connection request, thereby improving The speed of speech recognition.
  • FIG. 10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9, the system further includes an own cloud server 12 corresponding to the terminal;
  • the second identification submodule 330 includes:
  • the determining unit 331 is configured to determine whether the network signal strength is greater than a first preset value
  • the first identifying unit 332 is configured to: when the network signal strength is greater than the first preset value, send the sample feature to the own cloud server, and perform voice recognition by using the own cloud server;
  • the second identifying unit 333 is configured to input the sample feature to the voice recognition model when the network signal strength is less than the first preset value, and output the predicted recognition result.
  • the function module provided in this embodiment is used for voice recognition according to the network signal strength. Specifically, the determining unit 331 determines whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.
  • the first identifying unit 332 sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server.
  • the above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.
  • the second identifying unit 333 obtains the predicted recognition result according to the speech recognition model in the local database. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.
  • the terminal determines whether the network signal strength is greater than the first preset value, and if yes, sends the sample feature to the terminal's own cloud server, and searches in the own cloud server, when the network signal is not good, according to the voice Identify the recognition result of the model output prediction, avoid the corresponding delay, and improve the accuracy of network recognition. Combined with the method of searching in the local database and then searching in the own cloud server, the speed of speech recognition is also improved.
  • the self-owned cloud server 12 includes:
  • the searching module 201 is configured to perform a search in the own cloud server after the sample cloud feature is received by the own cloud server;
  • the identification module 202 is configured to acquire the recognition result when the identification result corresponding to the sample feature is searched in the own cloud server;
  • the sending module 203 is configured to: when the identification result corresponding to the sample feature is not found in the own cloud server, send the sample feature to a third party if it is detected that the network strength is greater than a second preset value a voice server that recognizes the input voice through the third party voice recognition server.
  • the search is performed in the own cloud server.
  • the recognition result corresponding to the sample feature is searched, the first identification subunit acquires the recognition result.
  • the self-owned cloud server it is also possible to store a more complex deep learning-based speech recognition model than the local database, and the output value obtained according to the speech recognition model, since the self-owned cloud server is deployed in the cloud, usually has multiple distributed Cache server, more computing power.
  • the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and fast.
  • the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .
  • the cloud server sends the sample feature to the first Three-party voice server.
  • the second preset value may be set as needed, and the second preset value may be the same as the first preset value, or may be different from the second preset value because accessing the own cloud server and accessing the third party
  • the network speed required by the cloud server may be different.
  • the third-party voice server is a server with stronger voice recognition capability.
  • the third-party voice server can provide a server provided by a vendor that provides voice recognition services, such as a voice recognition cloud server provided by the University of Science and Technology.
  • classification analysis can also be performed according to the characteristics of the user's age, identity, etc., and the database is established to make the recognition result more accurate.
  • the architecture of the own cloud server is as shown in Figure 7.
  • the deployment of the CDN server improves the difference in access speed between different regions.
  • the CDN server is also responsible for returning the data in the searched cache, and the user accesses the CDN server.
  • After reaching the reverse proxy server it is sent to the application server through the balanced load server.
  • the balanced load server can adapt to concurrent access of a large number of users, realize data offloading, and improve stability. It is also possible to add a local cache on the application server to quickly respond to the recognition result based on historical identification.
  • the database server can also be set to store the accounts and settings of a large number of users.
  • the interface is connected to the third-party voice recognition server, and the recognition capability of the third voice recognition server is combined to improve the accuracy of the recognition and improve the user experience.
  • the search is performed in the own cloud server.
  • the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained, and when the recognition is not detected.
  • the cloud server sends the sample feature to the third-party voice server, and the accuracy of the voice recognition is improved by identifying the third-party voice server.
  • sending sample features to third-party voice servers only improves when the network environment is good. Response delay during tone recognition.
  • the terminal 11 in the voice recognition system of the present invention further includes:
  • a comparison analysis module configured to compare the sample features with a preset sample library
  • a first triggering module configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;
  • the first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition;
  • the comparison analysis module compares the sample features with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample.
  • the preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.
  • the first trigger module triggers the search sub-module 310 to identify the input voice according to the sample feature and a local database preset in the terminal, where the input is preset.
  • a sample is a sample in a sample library that matches a sample feature to a similarity greater than a preset value.
  • the terminal sends the sample feature to the terminal's own cloud server, and performs voice recognition through the own cloud server, where is any sample feature.
  • a match that is less than the preset value means that the match between the sample feature and the sample library is less than the preset value.
  • the specific preset value may be set as needed, for example, may be set to 80%, and when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, the trigger search sub-module 310 is triggered according to the sample feature and the terminal.
  • the local database preset in the identifier identifies the input voice.
  • the terminal sends the sample feature to the terminal's own cloud server, and performs the operation through the own cloud server. Speech Recognition.
  • the sample features are compared with a preset sample library, and the sample is used as a sample.
  • the search is performed in the local database according to the sample feature, and when the similarity between the sample feature and any sample feature in the sample library is lower than a preset value, Directly sending sample features to the own cloud server for matching improves the speed of speech recognition while ensuring the accuracy of speech recognition.

Abstract

A voice recognition method and system, the method comprising: a terminal obtains an inputted voice (S10); the terminal extracts sample characteristics of the inputted voice (S20); the terminal recognizes the inputted voice according to the sample characteristics and a preset local database, the local database comprising a deep learning-based voice recognition model and a voice recognition output value set obtained according to the voice recognition model (S30). The accuracy of voice recognition during a voice interaction process is improved.

Description

语音识别的方法及系统Method and system for speech recognition 技术领域Technical field
本发明涉及语音交互领域,尤其涉及语音识别的方法及系统。The present invention relates to the field of voice interaction, and in particular to a method and system for voice recognition.
背景技术Background technique
随着互联网信息技术的发展,智能硬件的应用越来越广泛,例如智能电视、智能手环、智能机器人等。在智能硬件中,为了便捷的获取信息,大部分智能硬件厂商都提供语音交互这样一种人机交互方式。在进行语音交互时,智能硬件获取用户输入的语音信息,然后通过语音识别输出相应的信息或者执行相应的指令。当语音识别不准确时,智能硬件无法输出正确的信息或者执行正确的指令,降低用户体验,因此提升语音交互过程中语音识别的准确度是一个亟待解决的问题。With the development of Internet information technology, the application of intelligent hardware has become more and more extensive, such as smart TV, smart bracelet, intelligent robot and so on. In intelligent hardware, in order to obtain information conveniently, most intelligent hardware manufacturers provide a human-computer interaction method such as voice interaction. When performing voice interaction, the intelligent hardware acquires the voice information input by the user, and then outputs corresponding information or executes corresponding instructions through voice recognition. When the speech recognition is inaccurate, the intelligent hardware cannot output the correct information or execute the correct instructions to reduce the user experience. Therefore, improving the accuracy of the speech recognition during the speech interaction process is an urgent problem to be solved.
发明内容Summary of the invention
本发明的主要目的在于提供一种语音识别的方法及系统,旨在实现提高语音交互过程中语音识别的准确度的目的。The main object of the present invention is to provide a method and system for speech recognition, which aims to improve the accuracy of speech recognition in a speech interaction process.
为实现上述目的,本发明提供的一种语音识别的方法包括以下步骤:To achieve the above object, a method for voice recognition provided by the present invention includes the following steps:
终端获取输入语音;The terminal acquires an input voice;
所述终端提取所述输入语音的样本特征;The terminal extracts a sample feature of the input voice;
所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。The terminal identifies the input speech according to the sample feature and a preset local database, and the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
优选地,所述语音识别输出值集合包含语音的样本特征与语音识别输出值的对应关系;Preferably, the set of speech recognition output values includes a correspondence between a sample feature of the speech and a speech recognition output value;
所述终端根据所述样本特征和预置的本地数据库识别所述输入语音包括:The identifying, by the terminal, the input voice according to the sample feature and a preset local database includes:
所述终端在所述语音识别输出值集合中搜索是否有与所述样本 特征对应的语音识别输出值;The terminal searches for the sample of the voice recognition output value to see if there is a sample with the sample a speech recognition output value corresponding to the feature;
当搜索到所述样本特征对应的语音识别输出值时,所述终端获取所述语音识别输出值;When the voice recognition output value corresponding to the sample feature is searched, the terminal acquires the voice recognition output value;
当未搜索到所述样本特征对应的语音识别输出值时,所述终端根据检测到的所述终端的网络信号强度进行语音识别。When the voice recognition output value corresponding to the sample feature is not searched, the terminal performs voice recognition according to the detected network signal strength of the terminal.
优选地,所述终端在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值包括:Preferably, the terminal searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature comprises:
所述终端判断所述网络信号强度是否大于第一预设值;The terminal determines whether the network signal strength is greater than a first preset value;
若是,所述终端发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别;If yes, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition by using the own cloud server;
若否,所述终端将所述样本特征输入至所述语音识别模型,输出预测的识别结果。If not, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
优选地,所述方法还包括:Preferably, the method further includes:
在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;After the sample cloud feature is received by the own cloud server, searching is performed in the own cloud server;
当在所述自有云服务器中搜索到所述样本特征对应的识别结果时,所述自有云服务器获取所述识别结果;When the identification result corresponding to the sample feature is searched in the own cloud server, the own cloud server acquires the recognition result;
当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若所述云服务器检测到所述网络强度大于第二预设值,所述云服务器发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。When the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the sample feature to the cloud server when the cloud server detects that the network feature is greater than the second preset value. The third party voice server identifies the input voice through the third party voice recognition server.
优选地,所述终端根据所述样本特征在所述语音识别输出值中搜索是否有与所述样本特征对应的语音识别输出值之前包括:Preferably, the terminal before searching for the voice recognition output value corresponding to the sample feature in the voice recognition output value according to the sample feature comprises:
所述终端将所述样本特征与预置的样本库进行对比分析;The terminal compares the sample features with a preset sample library;
当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,所述终端执行所述根据所述样本特征在所述本地数据库中进行搜索的步骤;When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the terminal performs the step of searching in the local database according to the sample feature;
当所述样本特征与所述样本库中任一预置样本特征的相似度低于预设值时,所述终端执行所述发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别的步骤。 When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal performs the sending of the sample feature to the terminal's own cloud server, The steps of voice recognition by a cloud server are described.
此外,为实现上述目的,本发明还提供一种语音识别的系统,所述系统包括:终端;In addition, in order to achieve the above object, the present invention further provides a system for voice recognition, the system comprising: a terminal;
所述终端包括:The terminal includes:
获取模块,用于获取输入语音;Obtaining a module for acquiring an input voice;
特征提取模块,用于提取所述输入语音的样本特征;a feature extraction module, configured to extract sample features of the input voice;
语音识别模块,用于根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。And a voice recognition module, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
优选地,所述语音识别输出值集合包含语音的样本特征与语音识别输出值的对应关系;Preferably, the set of speech recognition output values includes a correspondence between a sample feature of the speech and a speech recognition output value;
所述语音识别模块包括:The speech recognition module includes:
搜索子模块,用于在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值;Searching a sub-module for searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;
第一识别子模块,用于当搜索到所述样本特征对应的语音识别输出值时,获取所述语音识别输出值;a first identification submodule, configured to acquire the speech recognition output value when the speech recognition output value corresponding to the sample feature is searched;
第二识别子模块,用于当未搜索到所述样本特征对应的语音识别输出值时,根据检测到的所述终端的网络信号强度进行语音识别。The second identification submodule is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.
优选地,所述系统还包括所述终端对应的自有云服务器;Preferably, the system further includes an own cloud server corresponding to the terminal;
所述第二识别子模块包括:The second identification submodule includes:
判断单元,用于判断所述网络信号强度是否大于第一预设值;a determining unit, configured to determine whether the network signal strength is greater than a first preset value;
第一识别单元,用于当所述网络信号强度大于第一预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别;a first identifying unit, configured to send the sample feature to the own cloud server when the network signal strength is greater than a first preset value, and perform voice recognition by using the own cloud server;
第二识别单元,用于当所述网络信号强度小于第一预设值时,将所述样本特征输入至所述语音识别模型,输出预测的识别结果。And a second identifying unit, configured to: when the network signal strength is less than the first preset value, input the sample feature to the voice recognition model, and output the predicted recognition result.
优选地,所述自有云服务器包括:Preferably, the own cloud server comprises:
搜索模块,用于在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;a search module, configured to perform a search in the own cloud server after receiving the sample feature by the own cloud server;
识别模块,用于当在所述自有云服务器中搜索到所述样本特征对应的识别结果时,获取所述识别结果; An identification module, configured to acquire the recognition result when searching for a recognition result corresponding to the sample feature in the own cloud server;
发送模块,用于当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若检测到所述网络强度大于第二预设值,发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。a sending module, configured to send the sample feature to a third-party voice if the network strength is greater than a second preset value when the recognition result corresponding to the sample feature is not found in the own cloud server The server identifies the input voice through the third party voice recognition server.
优选地,所述终端还包括:Preferably, the terminal further includes:
对比分析模块,用于将所述样本特征与预置的样本库进行对比分析;a comparison analysis module, configured to compare the sample features with a preset sample library;
第一触发模块,用于当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,触发所述搜索子模块根据所述样本特征在所述本地数据库中进行搜索;a first triggering module, configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;
所述第一识别单元,还用于当所述样本特征与所述样本库中预置样本特征的相似度低于预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别;The first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition;
本发明实施例通过终端获取输入语音;所述终端提取所述输入语音的样本特征;所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。由于本地数据库包含基于深度学习的语音识别模型和根据语音识别模型得到的语音识别的输出值,因此,在利用该本地数据库识别输入语音时,语音识别的结果更精确,从而实现了在语音交互的过程中提高语音识别的准确度的目的。The embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
附图说明DRAWINGS
图1为本发明语音识别的方法第一实施例的步骤流程示意图;1 is a schematic flow chart of steps of a first embodiment of a method for voice recognition according to the present invention;
图2为图1所示实施例中通过基本的声音结构得到某段声音的示意图;2 is a schematic diagram of obtaining a certain sound through a basic sound structure in the embodiment shown in FIG. 1;
图3为图1所示实施例中通过稀疏编码表示目标声音的示意图;3 is a schematic diagram showing a target sound by sparse coding in the embodiment shown in FIG. 1;
图4为本发明图1所示实施例中步骤S30的细化步骤流程示意图;4 is a schematic flow chart of the refinement step of step S30 in the embodiment shown in FIG. 1 according to the present invention;
图5为本发明为图4所示实施例中步骤S330中根据检测到的所述终端的网络信号强度进行语音识别的细化步骤流程示意图; FIG. 5 is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330 in the embodiment shown in FIG. 4;
图6为本发明自有云服务器中进行语音识别的细化步骤流程示意图;6 is a schematic flowchart of a refinement step of performing voice recognition in a self-owned cloud server according to the present invention;
图7为本发明图6所示实施例中自有云服务器的架构示意图;7 is a schematic structural diagram of a self-owned cloud server in the embodiment shown in FIG. 6 according to the present invention;
图8为本发明语音识别的系统第一实施例的功能模块示意图;8 is a schematic diagram of functional modules of a first embodiment of a voice recognition system according to the present invention;
图9为图8所示实施例中语音识别模块30的细化功能模块示意图;9 is a schematic diagram of a refinement function module of the speech recognition module 30 in the embodiment shown in FIG. 8;
图10为图9所示实施例中第二识别子模块330的细化功能模块示意图;10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9;
图11为本发明自有云服务器12包括的功能模块示意图。FIG. 11 is a schematic diagram of functional modules included in the own cloud server 12 of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明提供一种语音识别的方法。参照图1,在第一实施例中,该方法包括:The present invention provides a method of speech recognition. Referring to FIG. 1, in the first embodiment, the method includes:
步骤S10,终端获取输入语音;Step S10: The terminal acquires an input voice.
步骤S20,所述终端提取所述输入语音的样本特征;Step S20, the terminal extracts sample features of the input voice;
步骤S30,所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。Step S30, the terminal identifies the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
本发明提供的语音识别的方法,用于在语音交互的情况下,对输入的语音进行识别。在语音交互时,一般需要在终端通过终端的声音输入设备接收用户输入的语音,然后对接收到的声音进行处理,再将输入的声音转化为文字输出,或者是将输入的声音识别后,通过控制指令控制终端的运行。终端可以理解为接收声音输入的载体,终端可以为手机、平板、智能电视、智能空调、智能机器人等各种具备语音交互功能的设备。The method for voice recognition provided by the present invention is for identifying an input voice in the case of voice interaction. In the voice interaction, it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal. The terminal can be understood as a carrier for receiving sound input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.
本实施例中上述输入语音是语音交互过程中,用户输入的语音。当终端获取到输入的语音后,对语音进行处理,具体的,获取到的输入语音会以声音数据的形式存在,然后将声音数据进行频谱分析,再 提取样本特征存入终端。频谱分析是指,对信号进行傅里叶变换,得到其振幅谱与相位谱,具体的频谱分析的方法有很多,可以根据需要进行选择。对输入语音进行特征提取是为了进一步将语音进行分析,具体进行特征提取的方法属于现有技术,这里不再赘述,可以根据需要进行选择提取语音的方法。In the embodiment, the input voice is a voice input by a user during a voice interaction. After the terminal obtains the input voice, the voice is processed. Specifically, the obtained input voice exists in the form of voice data, and then the voice data is analyzed by spectrum, and then The extracted sample features are stored in the terminal. Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs. The feature extraction of the input speech is to further analyze the speech. The method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.
当提取输入语音的样本特征后,根据样本特征和预置的本地数据库识别输入语音。预置的本地数据库是存在于终端上的本地数据库,无需终端联网,就可以直接访问该数据库,获取数据库中的信息,可以将在本地数据库保存的数据理解为保存声音数据的二级缓存。After extracting the sample features of the input speech, the input speech is identified based on the sample characteristics and the preset local database. The preset local database is a local database existing on the terminal. Without terminal networking, the database can be directly accessed to obtain information in the database, and the data stored in the local database can be understood as a secondary cache for storing sound data.
在本地数据库中,包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。深度学习主要是利用类似人工神经网络的特征,人工神经网络时具有层次结构的系统,且层次是递进的,高层表达由低层表达的组合而成。由浅入深完成层次构建。具体的深度学习的实质是通过构建各种学习模型和海量的训练数据来学习重要的特征,从而达到提升判断的准确性。在进行深度学习时,会采集各种声音,对采集到的声音提取其特征,这些采集到的声音作为训练集,再将训练集通过持续学习提高模型预测准确度,训练的过程就是优化模型的权重的过程。在对样本进行了训练优化了模型后,将输入的声音输入给该模型,会得到输出值,该输出值就是对输入的声音进行识别的预测值。In the local database, a deep learning based speech recognition model and a speech recognition output value set obtained from the speech recognition model are included. Deep learning mainly uses features similar to artificial neural networks. Artificial neural networks have a hierarchical structure, and the hierarchy is progressive, and high-level expressions are composed of low-level expressions. Complete the level construction from shallow to deep. The essence of specific deep learning is to learn important features by constructing various learning models and massive training data, so as to improve the accuracy of judgment. During deep learning, various sounds are collected, and the collected sounds are extracted. The collected sounds are used as training sets, and the training set is used to improve the prediction accuracy of the model through continuous learning. The training process is to optimize the model. The process of weighting. After the sample has been trained to optimize the model, the input sound is input to the model, and an output value is obtained, which is a predicted value for identifying the input sound.
在建立语音识别模型时,可以采用稀疏编码算法来进行。稀疏编码(Sparse Coding)就是将一个信号表示为一组基的线性组合,而且要求只需要较少的几个基就可以将信号表示出来。根据现有技术的研究表明,各种无序的声音中可以找出20种基本的声音结构,其他的声音都可以通过这20种进行合成得出。如图2所示,左边表示20种基本的声音结构,右边表示根据这20种基本的声音结构合成的某一段声音,目标合成声音根据20种基本声音在合成时的权重值决定。在采用稀疏编码表示声音的特征,可以为Target=SUM(a[k]*S[k]),其中,a[k]是在叠加元素S[k]时的权重系数,S[k]是基本声音结构中的一种,如图3所示,为通过稀疏编码表示目标声音的Target=SUM (a[k]*S[k])的示意图,x为某一个时间点的声音,0.9为权重系数,即a[k],φ36为基本声音结构中的一种,即S[k]。通过稀疏编码的方式可以构建音高、音色、音量特征各异的样本集,然后将样本集通过预置的训练算法进行训练,优化语音识别模型的网络权重。常用的基于深度学习的训练算法有很多,可以根据需要进行选择。When establishing a speech recognition model, a sparse coding algorithm can be used. Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented. According to the research of the prior art, 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds. As shown in Fig. 2, the left side represents 20 basic sound structures, and the right side represents a certain sound synthesized based on the 20 basic sound structures, and the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition. The feature that represents the sound using sparse coding can be Target=SUM(a[k]*S[k]), where a[k] is the weight coefficient when the element S[k] is superimposed, and S[k] is One of the basic sound structures, as shown in FIG. 3, is a schematic diagram of Target=SUM (a[k]*S[k]) representing the target sound by sparse coding, where x is the sound of a certain time point, 0.9 is The weight coefficients, ie a[k], φ 36 are one of the basic sound structures, namely S[k]. The sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model. There are many commonly used deep learning-based training algorithms that can be selected as needed.
根据所述语音识别模型得到的语音识别输出值集合是指,将若干预先输入的语音通过特征提取经过语音识别模型得到的输出值的集合,这些输出值代表的是输入的语音的识别结果,可以在识别时根据搜索获取得到。The set of speech recognition output values obtained according to the speech recognition model refers to a set of output values obtained by extracting a plurality of pre-entered speech through feature extraction through a speech recognition model, and the output values represent recognition results of the input speech. Obtained according to the search at the time of identification.
本发明实施例通过终端获取输入语音;所述终端提取所述输入语音的样本特征;所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。由于本地数据库包含基于深度学习的语音识别模型和根据语音识别模型得到的语音识别的输出值,因此,在利用该本地数据库识别输入语音时,语音识别的结果更精确,从而实现了在语音交互的过程中提高语音识别的准确度的目的。The embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
参照图4,为步骤S30的细化步骤流程示意图,上述步骤S30包括:Referring to FIG. 4, it is a schematic flowchart of the refinement step of step S30, where the foregoing step S30 includes:
步骤S310,所述终端在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值;若是,执行步骤S320;若否,执行步骤S330;Step S310, the terminal searches for the voice recognition output value corresponding to the sample feature in the voice recognition output value set; if yes, step S320 is performed; if not, step S330 is performed;
步骤S320,所述终端获取所述语音识别输出值;Step S320, the terminal acquires the voice recognition output value;
步骤S330,所述终端根据检测到的所述终端的网络信号强度进行语音识别。Step S330, the terminal performs voice recognition according to the detected network signal strength of the terminal.
本实施例中,由于将一组语音的样本特征通过语音识别模型得到对应的输出值,因此在语音识别输出值集合中包含语音的样本特征与语音识别输出值的对应关系,在进行语音识别的时候,将所述样本特征在语音识别输出值集合中进行搜索,具体的在搜索时采用预置的搜索引擎,搜索是否存在样本特征对应的语音识别输出值。 In this embodiment, since the sample features of a set of speech are obtained by the speech recognition model, the corresponding relationship between the sample features of the speech and the speech recognition output value is included in the speech recognition output value set, and the speech recognition is performed. At the same time, the sample feature is searched in the set of speech recognition output values, and a preset search engine is used in the search to search whether there is a speech recognition output value corresponding to the sample feature.
当搜索到所述样本特征对应的语音识别输出值时,获取该语音识别输出值,该语音识别输出值即为该段输入语音的识别结果,然后可以在终端输出识别结果,也可以根据输出结果执行对应的操作,如该语音交互过程是控制智能机器人智能某些指令,则此时控制智能机器人执行对应的操作,又或者是该语音交互过程是在浏览器中进行某些内容的检索,则根据识别出的结果执行对应的检索过程将检索结果显示在用户终端。When the voice recognition output value corresponding to the sample feature is searched, the voice recognition output value is obtained, and the voice recognition output value is a recognition result of the input voice of the segment, and then the recognition result may be output at the terminal, or may be outputted according to the output result. Performing a corresponding operation, for example, the voice interaction process is to control certain instructions of the intelligent robot, and then controlling the intelligent robot to perform a corresponding operation, or the voice interaction process is to perform some content retrieval in the browser, The corresponding retrieval process is performed based on the recognized result to display the retrieval result on the user terminal.
当未搜索到样本特征对应的语音识别输出值时,根据检测到的终端的网络信号强度进行语音识别。上述终端的网络信号的强度的检测方法在现有技术中有很多,可以根据需要选择,不再赘述,对终端的网络信号的强度进行判断的目的是判断此时的网络环境。根据网络环境的状况是否良好进行下一步的操作。When the voice recognition output value corresponding to the sample feature is not searched, the voice recognition is performed according to the detected network signal strength of the terminal. The method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again. The purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.
本实施例通过终端在语音识别输出值集合中搜索是否有与样本特征对应的语音识别输出值,当检索到样本特征对应的语音识别输出值时,获取该语音识别输出值,提高了识别的准确度。同时,当未检索到样本特征对应的语音识别输出值时,根据终端的网络信号的强度进行语音识别,避免网络信号不强的时候尝试发送样本特征至其他服务器或等待其他服务器返回连接请求,提高了语音识别的速度。In this embodiment, the terminal searches for a voice recognition output value corresponding to the sample feature in the voice recognition output value set, and when the voice recognition output value corresponding to the sample feature is retrieved, the voice recognition output value is obtained, thereby improving the accuracy of the recognition. degree. At the same time, when the speech recognition output value corresponding to the sample feature is not retrieved, the speech recognition is performed according to the strength of the network signal of the terminal, to avoid attempting to send the sample feature to other servers when the network signal is not strong, or waiting for other servers to return the connection request, thereby improving The speed of speech recognition.
参照图5,为步骤S330中根据检测到的所述终端的网络信号强度进行语音识别的细化步骤流程示意图,所述步骤S330中根据检测到的所述终端的网络信号强度进行语音识别还包括:Referring to FIG. 5, it is a schematic flowchart of a refinement step of performing voice recognition according to the detected network signal strength of the terminal in step S330, and performing voice recognition according to the detected network signal strength of the terminal in step S330. :
步骤S331,所述终端判断所述网络信号强度是否大于第一预设值;若是,则执行步骤S332;若否,则执行步骤S333;Step S331, the terminal determines whether the network signal strength is greater than the first preset value; if yes, step S332 is performed; if not, step S333 is performed;
步骤S332,所述终端发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别;Step S332, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server.
步骤S333,所述终端将所述样本特征输入至所述语音识别模型,输出预测的识别结果。Step S333, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
本实施例是终端根据网络信号强度进行语音识别的细化步骤。具体的判断网络信号强度是否大于预设值,预设值的大小可以根据需要进行设定,可以为固定的值,也可以为变化的值。 This embodiment is a refinement step of the terminal performing voice recognition according to the strength of the network signal. Specifically, it is determined whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.
当网络信号强度大于第一预设值时,表明此时终端的网络环境好,此时发送样本特征至终端的自有云服务器,通过自有云服务器进行语音识别。上述自有云服务器是指该终端的网络端云服务器,自有云服务器中存在的数据可以理解为一级缓存。When the network signal strength is greater than the first preset value, it indicates that the network environment of the terminal is good at this time. At this time, the sample feature is sent to the own cloud server of the terminal, and the voice recognition is performed by the own cloud server. The above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.
当此时网络信号强度不大于第一预设值时,表明此时网络环境可能较差,发送样本特征至自有云服务器可能无法发送成功,因此,直接根据本地数据库中的语音识别模型得到预测的识别结果。可以理解的是,在语音识别模型中若输出的预测结果包括预测值和置信度,则可以在输出时确认通过语音识别模型输出的置信度最高的结果为语音识别的结果。When the network signal strength is not greater than the first preset value at this time, it indicates that the network environment may be poor at this time, and sending the sample feature to the own cloud server may not be successfully sent. Therefore, the prediction is directly performed according to the speech recognition model in the local database. Identification result. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.
本实施例通过终端判断网路信号强度是否大于第一预设值,若是则发送样本特征至终端的自有云服务器,在自有云服务器中进行检索,当网络信号不好的时候,根据语音识别模型输出预测的识别结果,避免相应延时,同时提高网络识别的准确度。结合先在本地数据库搜索再在自有云服务器进行检索的方式,也提高了语音识别的速度。In this embodiment, the terminal determines whether the network signal strength is greater than the first preset value, and if yes, sends the sample feature to the terminal's own cloud server, and searches in the own cloud server, when the network signal is not good, according to the voice Identify the recognition result of the model output prediction, avoid the corresponding delay, and improve the accuracy of network recognition. Combined with the method of searching in the local database and then searching in the own cloud server, the speed of speech recognition is also improved.
参照图6,为自有云服务器中进行语音识别的细化步骤流程示意图,本实施例中,本发明提出的语音识别的方法还包括:Referring to FIG. 6 , a schematic flowchart of a refinement step for voice recognition in a self-owned cloud server, in this embodiment, the method for voice recognition provided by the present invention further includes:
步骤S101,在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;Step S101: After the sample cloud feature is received by the own cloud server, perform a search in the own cloud server.
步骤S102,当在所述自有云服务器中搜索到所述样本特征对应的识别结果时,所述自有云服务器获取所述识别结果;Step S102, when the identification result corresponding to the sample feature is searched in the own cloud server, the own cloud server acquires the recognition result;
步骤S103,当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若所述云服务器检测到所述网络强度大于第二预设值,所述云服务器发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。In step S103, when the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the The sample feature is sent to a third party voice server, and the input voice is identified by the third party voice recognition server.
本实施例中主要说明了在自有云服务器中进行语音识别的过程。In this embodiment, the process of performing speech recognition in a self-owned cloud server is mainly described.
当终端发送样本特征到自有云服务器后,在自有云服务器中进行搜索,当搜索到样本特征对应的识别结果时,获取该识别结果。After the terminal sends the sample feature to the own cloud server, the search is performed in the own cloud server, and when the recognition result corresponding to the sample feature is searched, the recognition result is obtained.
在自有云服务器中也可以存放比本地数据库中更复杂的基于深度学习的语音识别模型,和根据该语音识别模型得到的输出值,因为 自有云服务器部署在云端,通常有多个分布式缓存服务器,计算能力更强。同时,在本地数据库中存放的语音识别的输出结果,可以根据使用情况存放使用频率最高的,在自有云服务器中存放使用频率略低的。可以理解的是,本地数据库中存放的数据和自有云服务器中存放的数据随着使用进行不断更新,从而使得语音识别的过程更加精确和快速。同时,在获取语音识别的结果时,就可以将识别结果保存至自有云服务器和/或本地数据库,并且根据识别结果进行深度学习,使得随着使用次数增多,语音识别模型的预测度更加精确。It is also possible to store a more complex deep learning-based speech recognition model than the local database in the own cloud server, and the output value obtained from the speech recognition model, because The self-owned cloud server is deployed in the cloud and usually has multiple distributed cache servers with more computing power. At the same time, the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and faster. At the same time, when the result of the speech recognition is obtained, the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .
当在自有云服务器中未检索到样本特征对应的识别结果时,若检测到网络强度大于第二预设值,即终端的网络强度大于第二预设值时,云服务器发送样本特征至第三方语音服务器。这里第二预设值可以根据需要进行设定,第二预设值的大小可以和第一预设值一样,也可以和第二预设值不一样,因为访问自有云服务器和访问第三方云服务器需要的网速可能是不一样的。上述第三方语音服务器为语音识别能力更强的服务器,通常第三方语音服务器可以为专门提供语音识别服务的厂商所提供的服务器,如科大讯飞网络提供的语音识别云服务器。When the recognition result corresponding to the sample feature is not retrieved in the own cloud server, if the network strength is greater than the second preset value, that is, the network strength of the terminal is greater than the second preset value, the cloud server sends the sample feature to the first Three-party voice server. The second preset value may be set as needed, and the second preset value may be the same as the first preset value, or may be different from the second preset value because accessing the own cloud server and accessing the third party The network speed required by the cloud server may be different. The third-party voice server is a server with stronger voice recognition capability. Usually, the third-party voice server can provide a server provided by a vendor that provides voice recognition services, such as a voice recognition cloud server provided by the University of Science and Technology.
可以理解的是,也可以根据使用者的年龄、身份等特征进行分类分析,建立数据库,使得识别结果更准确。It can be understood that the classification analysis can also be performed according to the characteristics of the user's age, identity, etc., and the database is established to make the recognition result more accurate.
在实现时,自有云服务器的架构如图7所示,部署CDN服务器提高不同地域访问速度差异的问题,同时CDN服务器也负责将搜索到的缓存中的数据进行返回,用户的访问通过CDN服务器到达反向代理服务器,再通过均衡负载服务器,发送至应用服务器,均衡负载服务器可以适应大量用户的并发访问,实现数据分流,提高稳定性。在应用服务器上还可以增加设置本地缓存,根据历史识别情况快速响应识别结果。在语音交互过程中通过搜索引擎与非关系型数据库配合完成,还可以设置数据库服务器来存储大量用户的账号和设置。同时,向上与第三方语音识别服务器对接,结合第三语音识别服务器的识别能力提高识别的准确度,提高用户体验。In implementation, the architecture of the own cloud server is as shown in Figure 7. The deployment of the CDN server improves the difference in access speed between different regions. At the same time, the CDN server is also responsible for returning the data in the searched cache, and the user accesses the CDN server. After reaching the reverse proxy server, it is sent to the application server through the balanced load server. The balanced load server can adapt to concurrent access of a large number of users, realize data offloading, and improve stability. It is also possible to add a local cache on the application server to quickly respond to the recognition result based on historical identification. Through the cooperation of the search engine and the non-relational database in the process of voice interaction, the database server can also be set to store the accounts and settings of a large number of users. At the same time, the interface is connected to the third-party voice recognition server, and the recognition capability of the third voice recognition server is combined to improve the accuracy of the recognition and improve the user experience.
本实施例通过在自有云服务器接收到样本特征后,在自有云服务 器中进行搜索,当在自有云服务器中搜索到样本特征对应的识别结果时,获取识别结果,当未检测到识别结果时且终端的网络强度大于第二预设值时,云服务器发送样本特征至第三方语音服务器,通过在第三方语音服务器进行识别提高语音识别的准确度。并且,只在网络环境较好的情况下才发送样本特征至第三方语音服务器提高避免了语音识别过程中的响应延时。In this embodiment, after receiving the sample feature in the own cloud server, the self-owned cloud service is provided. Searching in the device, when the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained. When the recognition result is not detected and the network strength of the terminal is greater than the second preset value, the cloud server sends the sample. Features to third-party voice servers that improve the accuracy of speech recognition by identifying them on third-party voice servers. Moreover, sending the sample features to the third-party voice server only when the network environment is good improves the response delay in the speech recognition process.
本实施例中,上述步骤S310之前包括:In this embodiment, the foregoing step S310 includes:
所述终端将所述样本特征与预置的样本库进行对比分析;The terminal compares the sample features with a preset sample library;
当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,执行步骤S310;When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, step S310 is performed;
当所述样本特征与所述样本库中任一预置样本特征的相似度低于预设值时,执行步骤S332。When the similarity between the sample feature and any of the preset sample features in the sample library is lower than a preset value, step S332 is performed.
本实施例中在获取到输入语音,提取输入语音的样本特征后,对样本特征与预置的样本库进行对比分析,目的是判断是在本地数据库中直接进行搜索,还是直接发送样本特征至自有云服务器中进行搜索。预置的样本库可以根据需要预先设定。具体的是将样本特征与样本库中的样本特征进行匹配,上述预置样本是指样本库中的预置的样本。In this embodiment, after the input voice is acquired and the sample features of the input voice are extracted, the sample features are compared with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample features to the self. There is a search in the cloud server. The preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.
当样本特征与样本库中预置样本特征的相似度高于预设值时,执行步骤S310,即终端根据样本特征和终端中预置的本地数据库识别所述输入语音,这里预置样本是指样本库中与样本特征匹配到相似度大于预设值的样本。当样本特征与样本库中任一预置样本特征的相似度低于预设值时,执行步骤S332,即终端发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别,这里任一样本特征匹配都小于预设值是指样本特征与样本库中任何一个匹配度都小于预设值。When the similarity between the sample feature and the preset sample feature in the sample library is higher than the preset value, step S310 is performed, that is, the terminal identifies the input voice according to the sample feature and a local database preset in the terminal, where the preset sample refers to The sample library matches the sample feature to a sample whose similarity is greater than the preset value. When the similarity between the sample feature and the preset sample feature in the sample library is lower than the preset value, step S332 is performed, that is, the terminal sends the sample feature to the terminal's own cloud server, through the own cloud. The server performs speech recognition. Here, any sample feature matching is smaller than the preset value, which means that the matching degree between the sample feature and the sample library is less than a preset value.
具体的预设值可以根据需要进行设定,例如可以设置为80%,则当样本特征与样本库中预置样本特征的相似度高于80%时,执行步骤S310,当样本特征与样本库中预置样本特征的相似度低于80%时,执行步骤S332。 The specific preset value may be set as needed, for example, may be set to 80%, when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, step S310 is performed, when the sample feature and the sample library are used. When the similarity of the preset sample features is less than 80%, step S332 is performed.
本实施例通过将样本特征与预置的样本库进行对比分析,当样本特征与样本库中预置样本特征的相似度高于预设值时,根据样本特征在本地数据库中进行搜索,当样本特征与样本库中任一样本特征的相似度低于预设值时,直接发送样本特征至自有云服务器进行匹配,在保证语音识别的准确度的同时提高了语音识别的速度。In this embodiment, by comparing the sample features with the preset sample library, when the similarity between the sample features and the preset sample features in the sample library is higher than a preset value, the search is performed in the local database according to the sample features, and the sample is When the similarity between the feature and any sample feature in the sample library is lower than the preset value, the sample feature is directly sent to the own cloud server for matching, and the speed of the voice recognition is improved while ensuring the accuracy of the voice recognition.
本发明还提供一种语音识别的系统,参照图8,提供了本发明语音识别的系统第一实施例,该实施例中,语音识别的系统包括终端11:所述终端包括:The present invention also provides a system for voice recognition. Referring to FIG. 8, a first embodiment of a system for voice recognition according to the present invention is provided. In this embodiment, the system for voice recognition includes a terminal 11: the terminal includes:
获取模块10,用于获取输入语音;The obtaining module 10 is configured to acquire an input voice;
特征提取模块20,用于提取所述输入语音的样本特征;a feature extraction module 20, configured to extract sample features of the input voice;
语音识别模块30,用于根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。a voice recognition module 30, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model .
本发明提供的语音识别的系统,用于在语音交互的情况下,对输入的语音进行识别。在语音交互时,一般需要在终端通过终端的声音输入设备接收用户输入的语音,然后对接收到的声音进行处理,再将输入的声音转化为文字输出,或者是将输入的声音识别后,通过控制指令控制终端的运行。这里语音识别的系统包括终端,终端可以理解为接收声音输入的载体,终端可以为手机、平板、智能电视、智能空调、智能机器人等各种具备语音交互功能的设备。The system for voice recognition provided by the present invention is for identifying an input voice in the case of voice interaction. In the voice interaction, it is generally required that the terminal receives the voice input by the user through the voice input device of the terminal, and then processes the received voice, and then converts the input voice into a text output, or after the input voice is recognized, Control commands control the operation of the terminal. The system for voice recognition includes a terminal, and the terminal can be understood as a carrier for receiving voice input, and the terminal can be various devices with voice interaction functions such as a mobile phone, a tablet, a smart TV, a smart air conditioner, and an intelligent robot.
本实施例中上述输入语音是语音交互过程中,用户输入的语音。当获取模块10获取到输入的语音后,特征提取模块20对语音进行处理,具体的,获取到的输入语音会以声音数据的形式存在,然后将声音数据进行频谱分析,再提取样本特征存入终端。频谱分析是指,对信号进行傅里叶变换,得到其振幅谱与相位谱,具体的频谱分析的方法有很多,可以根据需要进行选择。对输入语音进行特征提取是为了进一步将语音进行分析,具体进行特征提取的方法属于现有技术,这里不再赘述,可以根据需要进行选择提取语音的方法。In the embodiment, the input voice is a voice input by a user during a voice interaction. After the acquisition module 10 obtains the input voice, the feature extraction module 20 processes the voice. Specifically, the acquired input voice exists in the form of sound data, and then the sound data is subjected to spectrum analysis, and then the sample feature is extracted. terminal. Spectral analysis means that the signal is Fourier transformed to obtain its amplitude spectrum and phase spectrum. There are many methods for spectrum analysis, which can be selected according to needs. The feature extraction of the input speech is to further analyze the speech. The method for feature extraction is a prior art, and is not described here. The method for selecting the speech may be selected as needed.
当提取输入语音的样本特征后,语音识别模块30根据样本特征和预置的本地数据库识别输入语音。预置的本地数据库是存在于终端 上的本地数据库,无需终端联网,就可以直接访问该数据库,获取数据库中的信息,可以将在本地数据库保存的数据理解为保存声音数据的二级缓存。After extracting the sample features of the input speech, the speech recognition module 30 identifies the input speech based on the sample features and the preset local database. The preset local database is present in the terminal The local database on the network can access the database directly without the terminal networking, and obtain the information in the database. The data saved in the local database can be understood as the secondary cache for storing the sound data.
在本地数据库中,包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。深度学习主要是利用类似人工神经网络的特征,人工神经网络时具有层次结构的系统,且层次是递进的,高层表达由低层表达的组合而成。由浅入深完成层次构建。具体的深度学习的实质是通过构建各种学习模型和海量的训练数据来学习重要的特征,从而达到提升判断的准确性。在进行深度学习时,会采集各种声音,对采集到的声音提取其特征,这些采集到的声音作为训练集,再将训练集通过持续学习提高模型预测准确度,训练的过程就是优化模型的权重的过程。在对样本进行了训练优化了模型后,将输入的声音输入给该模型,会得到输出值,该输出值就是对输入的声音进行识别的预测值。In the local database, a deep learning based speech recognition model and a speech recognition output value set obtained from the speech recognition model are included. Deep learning mainly uses features similar to artificial neural networks. Artificial neural networks have a hierarchical structure, and the hierarchy is progressive, and high-level expressions are composed of low-level expressions. Complete the level construction from shallow to deep. The essence of specific deep learning is to learn important features by constructing various learning models and massive training data, so as to improve the accuracy of judgment. During deep learning, various sounds are collected, and the collected sounds are extracted. The collected sounds are used as training sets, and the training set is used to improve the prediction accuracy of the model through continuous learning. The training process is to optimize the model. The process of weighting. After the sample has been trained to optimize the model, the input sound is input to the model, and an output value is obtained, which is a predicted value for identifying the input sound.
在建立语音识别模型时,可以采用稀疏编码算法来进行。稀疏编码(Sparse Coding)就是将一个信号表示为一组基的线性组合,而且要求只需要较少的几个基就可以将信号表示出来。根据现有技术的研究表明,各种无序的声音中可以找出20种基本的声音结构,其他的声音都可以通过这20种进行合成得出。如图2所示,左边表示20种基本的声音结构,右边表示根据这20种基本的声音结构合成的某一段声音,目标合成声音根据20种基本声音在合成时的权重值决定。在采用稀疏编码表示声音的特征,可以为Target=SUM(a[k]*S[k]),其中,a[k]是在叠加元素S[k]时的权重系数,S[k]是基本声音结构中的一种,如图3所示,为通过稀疏编码表示目标声音的Target=SUM(a[k]*S[k])的示意图,x为某一个时间点的声音,0.9为权重系数,即a[k],φ36为基本声音结构中的一种,即S[k]。通过稀疏编码的方式可以构建音高、音色、音量特征各异的样本集,然后将样本集通过预置的训练算法进行训练,优化语音识别模型的网络权重。常用的基于深度学习的训练算法有很多,可以根据需要进行选择。When establishing a speech recognition model, a sparse coding algorithm can be used. Sparse Coding is a linear combination of a signal represented as a set of bases, and requires only a few bases to be represented. According to the research of the prior art, 20 basic sound structures can be found in various disordered sounds, and other sounds can be synthesized through these 20 kinds. As shown in Fig. 2, the left side represents 20 basic sound structures, and the right side represents a certain sound synthesized based on the 20 basic sound structures, and the target synthesized sound is determined according to the weight values of the 20 basic sounds at the time of composition. The feature that represents the sound using sparse coding can be Target=SUM(a[k]*S[k]), where a[k] is the weight coefficient when the element S[k] is superimposed, and S[k] is One of the basic sound structures, as shown in FIG. 3, is a schematic diagram showing Target=SUM(a[k]*S[k]) of the target sound by sparse coding, where x is the sound of a certain time point, 0.9 is The weight coefficients, ie a[k], φ 36 are one of the basic sound structures, namely S[k]. The sample set with different pitch, timbre and volume characteristics can be constructed by sparse coding, and then the sample set is trained by the preset training algorithm to optimize the network weight of the speech recognition model. There are many commonly used deep learning-based training algorithms that can be selected as needed.
根据所述语音识别模型得到的语音识别输出值集合是指,将若干 预先输入的语音通过特征提取经过语音识别模型得到的输出值的集合,这些输出值代表的是输入的语音的识别结果,可以在识别时根据搜索获取得到。The set of speech recognition output values obtained according to the speech recognition model refers to The pre-inputted speech extracts a set of output values obtained through the speech recognition model through feature extraction, and these output values represent the recognition result of the input speech, which can be obtained according to the search at the time of recognition.
本发明实施例通过终端获取输入语音;所述终端提取所述输入语音的样本特征;所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。由于本地数据库包含基于深度学习的语音识别模型和根据语音识别模型得到的语音识别的输出值,因此,在利用该本地数据库识别输入语音时,语音识别的结果更精确,从而实现了在语音交互的过程中提高语音识别的准确度的目的。The embodiment of the present invention acquires input voice through a terminal; the terminal extracts a sample feature of the input voice; the terminal identifies the input voice according to the sample feature and a preset local database, and the local database includes deep learning based on a speech recognition model and a set of speech recognition output values obtained from the speech recognition model. Since the local database includes a speech recognition model based on deep learning and an output value of speech recognition based on the speech recognition model, the speech recognition result is more accurate when the input speech is recognized by the local database, thereby realizing the interaction in speech. The purpose of improving the accuracy of speech recognition in the process.
参照图9,为图8所示实施例中语音识别模块30的细化功能模块示意图,上述语音识别模块30包括:9 is a schematic diagram of a refinement function module of the speech recognition module 30 in the embodiment shown in FIG. 8. The speech recognition module 30 includes:
搜索子模块310,用于在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值; Search sub-module 310, configured to search, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;
第一识别子模块320,用于当搜索到所述样本特征对应的语音识别输出值时,获取所述语音识别输出值;The first identifier sub-module 320 is configured to acquire the voice recognition output value when the voice recognition output value corresponding to the sample feature is searched;
第二识别子模块330,用于当未搜索到所述样本特征对应的语音识别输出值时,根据检测到的所述终端的网络信号强度进行语音识别。The second identification sub-module 330 is configured to perform voice recognition according to the detected network signal strength of the terminal when the voice recognition output value corresponding to the sample feature is not searched.
本实施例中,由于将一组语音的样本特征通过语音识别模型得到对应的输出值,因此在语音识别输出值集合中包含语音的样本特征与语音识别输出值的对应关系,在进行语音识别的时候,将所述样本特征在语音识别输出值集合中进行搜索,具体的在搜索时采用预置的搜索引擎,搜索是否存在样本特征对应的语音识别输出值。In this embodiment, since the sample features of a set of speech are obtained by the speech recognition model, the corresponding relationship between the sample features of the speech and the speech recognition output value is included in the speech recognition output value set, and the speech recognition is performed. At the same time, the sample feature is searched in the set of speech recognition output values, and a preset search engine is used in the search to search whether there is a speech recognition output value corresponding to the sample feature.
当搜索子模块310搜索到所述样本特征对应的语音识别输出值时,第一识别子模块320获取该语音识别输出值,该语音识别输出值即为该段输入语音的识别结果,然后可以在终端输出识别结果,也可以根据输出结果执行对应的操作,如该语音交互过程是控制智能机器人智能某些指令,则此时控制智能机器人执行对应的操作,又或者是 该语音交互过程是在浏览器中进行某些内容的检索,则根据识别出的结果执行对应的检索过程将检索结果显示在用户终端。When the search sub-module 310 searches for the voice recognition output value corresponding to the sample feature, the first recognition sub-module 320 acquires the voice recognition output value, and the voice recognition output value is the recognition result of the segment input voice, and then The terminal outputs the recognition result, and may also perform a corresponding operation according to the output result. For example, if the voice interaction process is to control certain instructions of the intelligent robot, then the intelligent robot is controlled to perform the corresponding operation, or The voice interaction process is to perform retrieval of some content in the browser, and then execute a corresponding retrieval process according to the recognized result to display the retrieval result on the user terminal.
当未搜索到样本特征对应的语音识别输出值时,第二识别子模块330根据检测到的终端的网络信号强度进行语音识别。上述终端的网络信号的强度的检测方法在现有技术中有很多,可以根据需要选择,不再赘述,对终端的网络信号的强度进行判断的目的是判断此时的网络环境。根据网络环境的状况是否良好进行下一步的操作。When the voice recognition output value corresponding to the sample feature is not searched, the second recognition sub-module 330 performs voice recognition according to the detected network signal strength of the terminal. The method for detecting the strength of the network signal of the terminal is many in the prior art, and can be selected according to needs, and will not be described again. The purpose of judging the strength of the network signal of the terminal is to determine the network environment at this time. According to the condition of the network environment, proceed to the next step.
本实施例通过终端在语音识别输出值集合中搜索是否有与样本特征对应的语音识别输出值,当检索到样本特征对应的语音识别输出值时,获取该语音识别输出值,提高了识别的准确度。同时,当未检索到样本特征对应的语音识别输出值时,根据终端的网络信号的强度进行语音识别,避免网络信号不强的时候尝试发送样本特征至其他服务器或等待其他服务器返回连接请求,提高了语音识别的速度。In this embodiment, the terminal searches for a voice recognition output value corresponding to the sample feature in the voice recognition output value set, and when the voice recognition output value corresponding to the sample feature is retrieved, the voice recognition output value is obtained, thereby improving the accuracy of the recognition. degree. At the same time, when the speech recognition output value corresponding to the sample feature is not retrieved, the speech recognition is performed according to the strength of the network signal of the terminal, to avoid attempting to send the sample feature to other servers when the network signal is not strong, or waiting for other servers to return the connection request, thereby improving The speed of speech recognition.
参照图10,为图9所示实施例中第二识别子模块330的细化功能模块示意图,所述系统还包括所述终端对应的自有云服务器12;10 is a schematic diagram of a refinement function module of the second identification sub-module 330 in the embodiment shown in FIG. 9, the system further includes an own cloud server 12 corresponding to the terminal;
所述第二识别子模块330包括:The second identification submodule 330 includes:
判断单元331,用于判断所述网络信号强度是否大于第一预设值;The determining unit 331 is configured to determine whether the network signal strength is greater than a first preset value;
第一识别单元332,用于当所述网络信号强度大于第一预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别;The first identifying unit 332 is configured to: when the network signal strength is greater than the first preset value, send the sample feature to the own cloud server, and perform voice recognition by using the own cloud server;
第二识别单元333,用于当所述网络信号强度小于第一预设值时,将所述样本特征输入至所述语音识别模型,输出预测的识别结果。The second identifying unit 333 is configured to input the sample feature to the voice recognition model when the network signal strength is less than the first preset value, and output the predicted recognition result.
本实施例提供的功能模块用于根据网络信号强度进行语音识别。具体的,判断单元331判断网络信号强度是否大于预设值,预设值的大小可以根据需要进行设定,可以为固定的值,也可以为变化的值。The function module provided in this embodiment is used for voice recognition according to the network signal strength. Specifically, the determining unit 331 determines whether the network signal strength is greater than a preset value, and the preset value may be set according to requirements, and may be a fixed value or a changed value.
当网络信号强度大于第一预设值时,表明此时终端的网络环境好,第一识别单元332发送样本特征至终端的自有云服务器,通过自有云服务器进行语音识别。上述自有云服务器是指该终端的网络端云服务器,自有云服务器中存在的数据可以理解为一级缓存。When the network signal strength is greater than the first preset value, it indicates that the network environment of the terminal is good at this time, and the first identifying unit 332 sends the sample feature to the own cloud server of the terminal, and performs voice recognition through the own cloud server. The above-mentioned own cloud server refers to the network side cloud server of the terminal, and the data existing in the own cloud server can be understood as the level 1 cache.
当此时网络信号强度不大于第一预设值时,表明此时网络环境可 能较差,发送样本特征至自有云服务器可能无法发送成功,因此,第二识别单元333根据本地数据库中的语音识别模型得到预测的识别结果。可以理解的是,在语音识别模型中若输出的预测结果包括预测值和置信度,则可以在输出时确认通过语音识别模型输出的置信度最高的结果为语音识别的结果。When the network signal strength is not greater than the first preset value at this time, it indicates that the network environment can be Poorly, the sending of sample features to the own cloud server may not be successful. Therefore, the second identifying unit 333 obtains the predicted recognition result according to the speech recognition model in the local database. It can be understood that if the predicted result outputted in the speech recognition model includes the predicted value and the confidence, the result with the highest confidence output by the speech recognition model can be confirmed as the result of the speech recognition at the time of output.
本实施例通过终端判断网路信号强度是否大于第一预设值,若是则发送样本特征至终端的自有云服务器,在自有云服务器中进行检索,当网络信号不好的时候,根据语音识别模型输出预测的识别结果,避免相应延时,同时提高网络识别的准确度。结合先在本地数据库搜索再在自有云服务器进行检索的方式,也提高了语音识别的速度。In this embodiment, the terminal determines whether the network signal strength is greater than the first preset value, and if yes, sends the sample feature to the terminal's own cloud server, and searches in the own cloud server, when the network signal is not good, according to the voice Identify the recognition result of the model output prediction, avoid the corresponding delay, and improve the accuracy of network recognition. Combined with the method of searching in the local database and then searching in the own cloud server, the speed of speech recognition is also improved.
参照图11,为自有云服务器12包括的功能模块示意图,本实施例中,所述自有云服务器12包括:Referring to FIG. 11 , it is a schematic diagram of a function module included in the cloud server 12 . In this embodiment, the self-owned cloud server 12 includes:
搜索模块201,用于在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;The searching module 201 is configured to perform a search in the own cloud server after the sample cloud feature is received by the own cloud server;
识别模块202,用于当在所述自有云服务器中搜索到所述样本特征对应的识别结果时,获取所述识别结果;The identification module 202 is configured to acquire the recognition result when the identification result corresponding to the sample feature is searched in the own cloud server;
发送模块203,用于当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若检测到所述网络强度大于第二预设值,发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。The sending module 203 is configured to: when the identification result corresponding to the sample feature is not found in the own cloud server, send the sample feature to a third party if it is detected that the network strength is greater than a second preset value a voice server that recognizes the input voice through the third party voice recognition server.
本实施例中,当终端发送样本特征到自有云服务器后,在自有云服务器中进行搜索,当搜索到样本特征对应的识别结果时,第一识别子单元获取该识别结果。In this embodiment, after the terminal sends the sample feature to the own cloud server, the search is performed in the own cloud server. When the recognition result corresponding to the sample feature is searched, the first identification subunit acquires the recognition result.
在自有云服务器中也可以存放比本地数据库中更复杂的基于深度学习的语音识别模型,和根据该语音识别模型得到的输出值,因为自有云服务器部署在云端,通常有多个分布式缓存服务器,计算能力更强。同时,在本地数据库中存放的语音识别的输出结果,可以根据使用情况存放使用频率最高的,在自有云服务器中存放使用频率略低的。可以理解的是,本地数据库中存放的数据和自有云服务器中存放的数据随着使用进行不断更新,从而使得语音识别的过程更加精确和 快速。同时,在获取语音识别的结果时,就可以将识别结果保存至自有云服务器和/或本地数据库,并且根据识别结果进行深度学习,使得随着使用次数增多,语音识别模型的预测度更加精确。In the self-owned cloud server, it is also possible to store a more complex deep learning-based speech recognition model than the local database, and the output value obtained according to the speech recognition model, since the self-owned cloud server is deployed in the cloud, usually has multiple distributed Cache server, more computing power. At the same time, the output of the voice recognition stored in the local database can be stored at the highest frequency according to the usage, and stored at a slightly lower frequency in the own cloud server. Understandably, the data stored in the local database and the data stored in the own cloud server are continuously updated as the use, so that the process of speech recognition is more accurate and fast. At the same time, when the result of the speech recognition is obtained, the recognition result can be saved to the own cloud server and/or the local database, and the deep learning is performed according to the recognition result, so that the prediction degree of the speech recognition model is more accurate as the number of uses increases. .
当在自有云服务器中未检索到样本特征对应的识别结果时,若检测到网络强度大于第二预设值,即终端的网络强度大于第二预设值时,云服务器发送样本特征至第三方语音服务器。这里第二预设值可以根据需要进行设定,第二预设值的大小可以和第一预设值一样,也可以和第二预设值不一样,因为访问自有云服务器和访问第三方云服务器需要的网速可能是不一样的。上述第三方语音服务器为语音识别能力更强的服务器,通常第三方语音服务器可以为专门提供语音识别服务的厂商所提供的服务器,如科大讯飞网络提供的语音识别云服务器。When the recognition result corresponding to the sample feature is not retrieved in the own cloud server, if the network strength is greater than the second preset value, that is, the network strength of the terminal is greater than the second preset value, the cloud server sends the sample feature to the first Three-party voice server. The second preset value may be set as needed, and the second preset value may be the same as the first preset value, or may be different from the second preset value because accessing the own cloud server and accessing the third party The network speed required by the cloud server may be different. The third-party voice server is a server with stronger voice recognition capability. Usually, the third-party voice server can provide a server provided by a vendor that provides voice recognition services, such as a voice recognition cloud server provided by the University of Science and Technology.
可以理解的是,也可以根据使用者的年龄、身份等特征进行分类分析,建立数据库,使得识别结果更准确。It can be understood that the classification analysis can also be performed according to the characteristics of the user's age, identity, etc., and the database is established to make the recognition result more accurate.
在实现时,自有云服务器的架构如图7所示,部署CDN服务器提高不同地域访问速度差异的问题,同时CDN服务器也负责将搜索到的缓存中的数据进行返回,用户的访问通过CDN服务器到达反向代理服务器,再通过均衡负载服务器,发送至应用服务器,均衡负载服务器可以适应大量用户的并发访问,实现数据分流,提高稳定性。在应用服务器上还可以增加设置本地缓存,根据历史识别情况快速响应识别结果。在语音交互过程中通过搜索引擎与非关系型数据库配合完成,还可以设置数据库服务器来存储大量用户的账号和设置。同时,向上与第三方语音识别服务器对接,结合第三语音识别服务器的识别能力提高识别的准确度,提高用户体验。In implementation, the architecture of the own cloud server is as shown in Figure 7. The deployment of the CDN server improves the difference in access speed between different regions. At the same time, the CDN server is also responsible for returning the data in the searched cache, and the user accesses the CDN server. After reaching the reverse proxy server, it is sent to the application server through the balanced load server. The balanced load server can adapt to concurrent access of a large number of users, realize data offloading, and improve stability. It is also possible to add a local cache on the application server to quickly respond to the recognition result based on historical identification. Through the cooperation of the search engine and the non-relational database in the process of voice interaction, the database server can also be set to store the accounts and settings of a large number of users. At the same time, the interface is connected to the third-party voice recognition server, and the recognition capability of the third voice recognition server is combined to improve the accuracy of the recognition and improve the user experience.
本实施例通过在自有云服务器接收到样本特征后,在自有云服务器中进行搜索,当在自有云服务器中搜索到样本特征对应的识别结果时,获取识别结果,当未检测到识别结果时且终端的网络强度大于第二预设值时,云服务器发送样本特征至第三方语音服务器,通过在第三方语音服务器进行识别提高语音识别的准确度。并且,只在网络环境较好的情况下才发送样本特征至第三方语音服务器提高避免了语 音识别过程中的响应延时。In this embodiment, after the sample feature is received by the own cloud server, the search is performed in the own cloud server. When the recognition result corresponding to the sample feature is searched in the own cloud server, the recognition result is obtained, and when the recognition is not detected, When the network strength of the terminal is greater than the second preset value, the cloud server sends the sample feature to the third-party voice server, and the accuracy of the voice recognition is improved by identifying the third-party voice server. Moreover, sending sample features to third-party voice servers only improves when the network environment is good. Response delay during tone recognition.
本实施例中,本发明提出的语音识别的系统中终端11还包括:In this embodiment, the terminal 11 in the voice recognition system of the present invention further includes:
对比分析模块,用于将所述样本特征与预置的样本库进行对比分析;a comparison analysis module, configured to compare the sample features with a preset sample library;
第一触发模块,用于当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,触发所述搜索子模块根据所述样本特征在所述本地数据库中进行搜索;a first triggering module, configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;
所述第一识别单元,还用于当所述样本特征与所述样本库中预置样本特征的相似度低于预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别;The first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition;
本实施例中在获取到输入语音,提取输入语音的样本特征后,对比分析模块对样本特征与预置的样本库进行对比分析,目的是判断是在本地数据库中直接进行搜索,还是直接发送样本特征至自有云服务器中进行搜索。预置的样本库可以根据需要预先设定。具体的是将样本特征与样本库中的样本特征进行匹配,上述预置样本是指样本库中的预置的样本。In this embodiment, after the input voice is acquired and the sample features of the input voice are extracted, the comparison analysis module compares the sample features with the preset sample library, and the purpose is to determine whether to directly search in the local database or directly send the sample. Features to search in your own cloud server. The preset sample library can be preset as needed. Specifically, the sample features are matched with the sample features in the sample library, and the preset samples refer to preset samples in the sample library.
当样本特征与样本库中预置样本特征的相似度高于预设值时,第一触发模块触发搜索子模块310根据样本特征和终端中预置的本地数据库识别所述输入语音,这里预置样本是指样本库中与样本特征匹配到相似度大于预设值的样本。当样本特征与样本库中任一预置样本特征的相似度低于预设值时,终端发送样本特征至终端的自有云服务器,通过自有云服务器进行语音识别,这里是任一样本特征匹配都小于预设值是指样本特征与样本库中任何一个匹配度都小于预设值。When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the first trigger module triggers the search sub-module 310 to identify the input voice according to the sample feature and a local database preset in the terminal, where the input is preset. A sample is a sample in a sample library that matches a sample feature to a similarity greater than a preset value. When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal sends the sample feature to the terminal's own cloud server, and performs voice recognition through the own cloud server, where is any sample feature. A match that is less than the preset value means that the match between the sample feature and the sample library is less than the preset value.
具体的预设值可以根据需要进行设定,例如可以设置为80%,则当样本特征与样本库中预置样本特征的相似度高于80%时,触发搜索子模块310根据样本特征和终端中预置的本地数据库识别所述输入语音,当样本特征与样本库中预置样本特征的相似度低于80%时,终端发送样本特征至终端的自有云服务器,通过自有云服务器进行语音识别。The specific preset value may be set as needed, for example, may be set to 80%, and when the similarity between the sample feature and the preset sample feature in the sample library is higher than 80%, the trigger search sub-module 310 is triggered according to the sample feature and the terminal. The local database preset in the identifier identifies the input voice. When the similarity between the sample feature and the preset sample feature in the sample library is less than 80%, the terminal sends the sample feature to the terminal's own cloud server, and performs the operation through the own cloud server. Speech Recognition.
本实施例通过将样本特征与预置的样本库进行对比分析,当样本 特征与样本库中预置样本特征的相似度高于预设值时,根据样本特征在本地数据库中进行搜索,当样本特征与样本库中任一样本特征的相似度低于预设值时,直接发送样本特征至自有云服务器进行匹配,在保证语音识别的准确度的同时提高了语音识别的速度。In this embodiment, the sample features are compared with a preset sample library, and the sample is used as a sample. When the similarity between the feature and the preset sample feature in the sample library is higher than the preset value, the search is performed in the local database according to the sample feature, and when the similarity between the sample feature and any sample feature in the sample library is lower than a preset value, Directly sending sample features to the own cloud server for matching improves the speed of speech recognition while ensuring the accuracy of speech recognition.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.

Claims (10)

  1. 一种语音识别的方法,其特征在于,所述方法包括以下步骤:A method for speech recognition, characterized in that the method comprises the following steps:
    终端获取输入语音;The terminal acquires an input voice;
    所述终端提取所述输入语音的样本特征;The terminal extracts a sample feature of the input voice;
    所述终端根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。The terminal identifies the input speech according to the sample feature and a preset local database, and the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
  2. 如权利要求1所述的方法,其特征在于,所述语音识别输出值集合包含语音的样本特征与语音识别输出值的对应关系;The method of claim 1 wherein said set of speech recognition output values comprises a correspondence between sample features of speech and speech recognition output values;
    所述终端根据所述样本特征和预置的本地数据库识别所述输入语音包括:The identifying, by the terminal, the input voice according to the sample feature and a preset local database includes:
    所述终端在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值;The terminal searches, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;
    当搜索到所述样本特征对应的语音识别输出值时,所述终端获取所述语音识别输出值;When the voice recognition output value corresponding to the sample feature is searched, the terminal acquires the voice recognition output value;
    当未搜索到所述样本特征对应的语音识别输出值时,所述终端根据检测到的所述终端的网络信号强度进行语音识别。When the voice recognition output value corresponding to the sample feature is not searched, the terminal performs voice recognition according to the detected network signal strength of the terminal.
  3. 如权利要求2所述的方法,其特征在于,所述终端在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值包括:The method of claim 2, wherein the searching, by the terminal, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature comprises:
    所述终端判断所述网络信号强度是否大于第一预设值;The terminal determines whether the network signal strength is greater than a first preset value;
    若是,所述终端发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别;If yes, the terminal sends the sample feature to the own cloud server of the terminal, and performs voice recognition by using the own cloud server;
    若否,所述终端将所述样本特征输入至所述语音识别模型,输出预测的识别结果。If not, the terminal inputs the sample feature to the voice recognition model, and outputs the predicted recognition result.
  4. 如权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, wherein the method further comprises:
    在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;After the sample cloud feature is received by the own cloud server, searching is performed in the own cloud server;
    当在所述自有云服务器中搜索到所述样本特征对应的识别结果 时,所述自有云服务器获取所述识别结果;Searching for the recognition result corresponding to the sample feature in the own cloud server And obtaining, by the own cloud server, the identification result;
    当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若所述云服务器检测到所述网络强度大于第二预设值,所述云服务器发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。When the cloud server detects that the network strength is greater than a second preset value, the cloud server sends the sample feature to the cloud server when the cloud server detects that the network feature is greater than the second preset value. The third party voice server identifies the input voice through the third party voice recognition server.
  5. 如权利要求4所述的方法,其特征在于,所述终端根据所述样本特征在所述语音识别输出值中搜索是否有与所述样本特征对应的语音识别输出值之前包括:The method according to claim 4, wherein the terminal before searching for the speech recognition output value corresponding to the sample feature in the speech recognition output value according to the sample feature comprises:
    所述终端将所述样本特征与预置的样本库进行对比分析;The terminal compares the sample features with a preset sample library;
    当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,所述终端执行所述根据所述样本特征在所述本地数据库中进行搜索的步骤;When the similarity between the sample feature and the preset sample feature in the sample library is higher than a preset value, the terminal performs the step of searching in the local database according to the sample feature;
    当所述样本特征与所述样本库中任一预置样本特征的相似度低于预设值时,所述终端执行所述发送所述样本特征至所述终端的自有云服务器,通过所述自有云服务器进行语音识别的步骤。When the similarity between the sample feature and any preset sample feature in the sample library is lower than a preset value, the terminal performs the sending of the sample feature to the terminal's own cloud server, The steps of voice recognition by a cloud server are described.
  6. 一种语音识别的系统,其特征在于,所述系统包括:终端;A system for speech recognition, characterized in that the system comprises: a terminal;
    所述终端包括:The terminal includes:
    获取模块,用于获取输入语音;Obtaining a module for acquiring an input voice;
    特征提取模块,用于提取所述输入语音的样本特征;a feature extraction module, configured to extract sample features of the input voice;
    语音识别模块,用于根据所述样本特征和预置的本地数据库识别所述输入语音,所述本地数据库包含基于深度学习的语音识别模型和根据所述语音识别模型得到的语音识别输出值集合。And a voice recognition module, configured to identify the input voice according to the sample feature and a preset local database, where the local database includes a depth learning based speech recognition model and a speech recognition output value set obtained according to the speech recognition model.
  7. 如权利要求6所述的系统,其特征在于,所述语音识别输出值集合包含语音的样本特征与语音识别输出值的对应关系;The system of claim 6 wherein said set of speech recognition output values comprises a correspondence between sample features of speech and speech recognition output values;
    所述语音识别模块包括:The speech recognition module includes:
    搜索子模块,用于在所述语音识别输出值集合中搜索是否有与所述样本特征对应的语音识别输出值;Searching a sub-module for searching, in the set of speech recognition output values, whether there is a speech recognition output value corresponding to the sample feature;
    第一识别子模块,用于当搜索到所述样本特征对应的语音识别输出值时,获取所述语音识别输出值;a first identification submodule, configured to acquire the speech recognition output value when the speech recognition output value corresponding to the sample feature is searched;
    第二识别子模块,用于当未搜索到所述样本特征对应的语音识别 输出值时,根据检测到的所述终端的网络信号强度进行语音识别。a second identification submodule, configured to: when the speech recognition corresponding to the sample feature is not found When the value is output, speech recognition is performed based on the detected network signal strength of the terminal.
  8. 如权利要求7所述的系统,其特征在于,所述系统还包括所述终端对应的自有云服务器;The system according to claim 7, wherein the system further comprises an own cloud server corresponding to the terminal;
    所述第二识别子模块包括:The second identification submodule includes:
    判断单元,用于判断所述网络信号强度是否大于第一预设值;a determining unit, configured to determine whether the network signal strength is greater than a first preset value;
    第一识别单元,用于当所述网络信号强度大于第一预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别;a first identifying unit, configured to send the sample feature to the own cloud server when the network signal strength is greater than a first preset value, and perform voice recognition by using the own cloud server;
    第二识别单元,用于当所述网络信号强度小于第一预设值时,将所述样本特征输入至所述语音识别模型,输出预测的识别结果。And a second identifying unit, configured to: when the network signal strength is less than the first preset value, input the sample feature to the voice recognition model, and output the predicted recognition result.
  9. 如权利要求8所述的系统,其特征在于,所述自有云服务器包括:The system of claim 8 wherein said own cloud server comprises:
    搜索模块,用于在所述自有云服务器接收到所述样本特征后,在所述自有云服务器中进行搜索;a search module, configured to perform a search in the own cloud server after receiving the sample feature by the own cloud server;
    识别模块,用于当在所述自有云服务器中搜索到所述样本特征对应的识别结果时,获取所述识别结果;An identification module, configured to acquire the recognition result when searching for a recognition result corresponding to the sample feature in the own cloud server;
    发送模块,用于当在所述自有云服务器中未搜索到所述样本特征对应的识别结果时,若检测到所述网络强度大于第二预设值,发送所述样本特征至第三方语音服务器,通过所述第三方语音识别服务器识别所述输入语音。a sending module, configured to send the sample feature to a third-party voice if the network strength is greater than a second preset value when the recognition result corresponding to the sample feature is not found in the own cloud server The server identifies the input voice through the third party voice recognition server.
  10. 如权利要求9所述的系统,其特征在于,所述终端还包括:The system of claim 9, wherein the terminal further comprises:
    对比分析模块,用于将所述样本特征与预置的样本库进行对比分析;a comparison analysis module, configured to compare the sample features with a preset sample library;
    第一触发模块,用于当所述样本特征与所述样本库中预置样本特征的相似度高于预设值时,触发所述搜索子模块根据所述样本特征在所述本地数据库中进行搜索;a first triggering module, configured to trigger the search sub-module to perform in the local database according to the sample feature when a similarity between the sample feature and a preset sample feature in the sample library is higher than a preset value search for;
    所述第一识别单元,还用于当所述样本特征与所述样本库中预置样本特征的相似度低于预设值时,发送所述样本特征至所述自有云服务器,通过所述自有云服务器进行语音识别。 The first identifying unit is further configured to: when the similarity between the sample feature and the preset sample feature in the sample library is lower than a preset value, send the sample feature to the own cloud server, Self-owned cloud server for speech recognition.
PCT/CN2017/083065 2016-05-30 2017-05-04 Voice recognition method and system WO2017206661A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610370685.8A CN105931633A (en) 2016-05-30 2016-05-30 Speech recognition method and system
CN201610370685.8 2016-05-30

Publications (1)

Publication Number Publication Date
WO2017206661A1 true WO2017206661A1 (en) 2017-12-07

Family

ID=56841485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/083065 WO2017206661A1 (en) 2016-05-30 2017-05-04 Voice recognition method and system

Country Status (2)

Country Link
CN (1) CN105931633A (en)
WO (1) WO2017206661A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096620A (en) * 2021-03-24 2021-07-09 妙音音乐科技(武汉)有限公司 Musical instrument tone color identification method, system, equipment and storage medium

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931633A (en) * 2016-05-30 2016-09-07 深圳市鼎盛智能科技有限公司 Speech recognition method and system
CN106453859B (en) * 2016-09-23 2019-11-15 维沃移动通信有限公司 A kind of sound control method and mobile terminal
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
US10574777B2 (en) * 2017-06-06 2020-02-25 International Business Machines Corporation Edge caching for cognitive applications
CN107331384B (en) * 2017-06-12 2018-05-04 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN107742516B (en) * 2017-09-29 2020-11-17 上海望潮数据科技有限公司 Intelligent recognition method, robot and computer readable storage medium
CN108039174A (en) * 2018-01-08 2018-05-15 珠海格力电器股份有限公司 Speech recognition system, method and apparatus
CN109005451B (en) * 2018-06-29 2021-07-30 杭州星犀科技有限公司 Video strip splitting method based on deep learning
CN110839051B (en) * 2018-08-16 2022-07-01 科沃斯商用机器人有限公司 Service providing method, device, robot and storage medium
CN109377997B (en) * 2018-12-10 2021-06-01 珠海格力电器股份有限公司 Voice control method and device for household appliance, storage medium and household appliance system
CN109605373A (en) * 2018-12-21 2019-04-12 重庆大学 Voice interactive method based on robot
CN111625362A (en) * 2020-05-29 2020-09-04 浪潮电子信息产业股份有限公司 Computing resource scheduling method and device and related components
CN113948085B (en) * 2021-12-22 2022-03-25 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120108A1 (en) * 2006-11-16 2008-05-22 Frank Kao-Ping Soong Multi-space distribution for pattern recognition based on mixed continuous and discrete observations
CN102708865A (en) * 2012-04-25 2012-10-03 北京车音网科技有限公司 Method, device and system for voice recognition
CN103295575A (en) * 2012-02-27 2013-09-11 北京三星通信技术研究有限公司 Speech recognition method and client
CN103488401A (en) * 2013-09-30 2014-01-01 乐视致新电子科技(天津)有限公司 Voice assistant activating method and device
CN103489444A (en) * 2013-09-30 2014-01-01 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN103956168A (en) * 2014-03-29 2014-07-30 深圳创维数字技术股份有限公司 Voice recognition method and device, and terminal
CN104575503A (en) * 2015-01-16 2015-04-29 广东美的制冷设备有限公司 Speech recognition method and device
CN104715752A (en) * 2015-04-09 2015-06-17 刘文军 Voice recognition method, voice recognition device and voice recognition system
CN105118508A (en) * 2015-09-14 2015-12-02 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN105931633A (en) * 2016-05-30 2016-09-07 深圳市鼎盛智能科技有限公司 Speech recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972253B2 (en) * 2010-09-15 2015-03-03 Microsoft Technology Licensing, Llc Deep belief network for large vocabulary continuous speech recognition

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120108A1 (en) * 2006-11-16 2008-05-22 Frank Kao-Ping Soong Multi-space distribution for pattern recognition based on mixed continuous and discrete observations
CN103295575A (en) * 2012-02-27 2013-09-11 北京三星通信技术研究有限公司 Speech recognition method and client
CN102708865A (en) * 2012-04-25 2012-10-03 北京车音网科技有限公司 Method, device and system for voice recognition
CN103488401A (en) * 2013-09-30 2014-01-01 乐视致新电子科技(天津)有限公司 Voice assistant activating method and device
CN103489444A (en) * 2013-09-30 2014-01-01 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN103956168A (en) * 2014-03-29 2014-07-30 深圳创维数字技术股份有限公司 Voice recognition method and device, and terminal
CN104575503A (en) * 2015-01-16 2015-04-29 广东美的制冷设备有限公司 Speech recognition method and device
CN104715752A (en) * 2015-04-09 2015-06-17 刘文军 Voice recognition method, voice recognition device and voice recognition system
CN105118508A (en) * 2015-09-14 2015-12-02 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN105931633A (en) * 2016-05-30 2016-09-07 深圳市鼎盛智能科技有限公司 Speech recognition method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096620A (en) * 2021-03-24 2021-07-09 妙音音乐科技(武汉)有限公司 Musical instrument tone color identification method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN105931633A (en) 2016-09-07

Similar Documents

Publication Publication Date Title
WO2017206661A1 (en) Voice recognition method and system
CN111104495B (en) Information interaction method, device, equipment and storage medium based on intention recognition
US10657966B2 (en) Better resolution when referencing to concepts
JP6440732B2 (en) Automatic task classification based on machine learning
WO2020182122A1 (en) Text matching model generation method and device
US9990923B2 (en) Automated software execution using intelligent speech recognition
CN108255934B (en) Voice control method and device
WO2020087974A1 (en) Model generation method and device
CN110083693B (en) Robot dialogue reply method and device
US20160189715A1 (en) Speech recognition device and method
CN111428010B (en) Man-machine intelligent question-answering method and device
US11127399B2 (en) Method and apparatus for pushing information
CN109271533A (en) A kind of multimedia document retrieval method
US20140207716A1 (en) Natural language processing method and system
CN110795532A (en) Voice information processing method and device, intelligent terminal and storage medium
CN113806588B (en) Method and device for searching video
CN108682415B (en) Voice search method, device and system
US20150104065A1 (en) Apparatus and method for recognizing object in image
CN108572746B (en) Method, apparatus and computer readable storage medium for locating mobile device
CN108536680B (en) Method and device for acquiring house property information
CN114239805A (en) Cross-modal retrieval neural network, training method and device, electronic equipment and medium
KR101801250B1 (en) Method and system for automatically tagging themes suited for songs
CN111640450A (en) Multi-person audio processing method, device, equipment and readable storage medium
CN108694939B (en) Voice search optimization method, device and system
CN113555005B (en) Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17805606

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17805606

Country of ref document: EP

Kind code of ref document: A1