WO2019015435A1 - 语音识别方法、装置和存储介质 - Google Patents

语音识别方法、装置和存储介质 Download PDF

Info

Publication number
WO2019015435A1
WO2019015435A1 PCT/CN2018/091926 CN2018091926W WO2019015435A1 WO 2019015435 A1 WO2019015435 A1 WO 2019015435A1 CN 2018091926 W CN2018091926 W CN 2018091926W WO 2019015435 A1 WO2019015435 A1 WO 2019015435A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
wake
fuzzy
speech recognition
word
Prior art date
Application number
PCT/CN2018/091926
Other languages
English (en)
French (fr)
Inventor
唐惠忠
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020502569A priority Critical patent/JP6949195B2/ja
Priority to KR1020207004025A priority patent/KR102354275B1/ko
Publication of WO2019015435A1 publication Critical patent/WO2019015435A1/zh
Priority to US16/743,150 priority patent/US11244672B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • G06F1/3215Monitoring of peripheral devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • G06F1/3231Monitoring the presence, absence or movement of users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3293Power saving characterised by the action undertaken by switching to a less power-consuming processor, e.g. sub-CPU
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72409User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality by interfacing with external accessories
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/02Power saving arrangements
    • H04W52/0209Power saving arrangements in terminal devices
    • H04W52/0261Power saving arrangements in terminal devices managing power supply demand, e.g. depending on battery level
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention relates to the field of communication technologies, and in particular, to speech recognition.
  • intelligent hardware products With the development of artificial intelligence, intelligent hardware products have also developed rapidly.
  • intelligent hardware products refer to hardware devices integrated with artificial intelligence functions, such as smart mobile terminals (referred to as mobile terminals).
  • the core of intelligent hardware products is inseparable from the interaction with "people", and voice interaction as a natural, low-learning interaction has become the mainstream technology of intelligent hardware products.
  • the recording function of the terminal is generally required to be turned on, and the central processing unit (CPU, Central Processing Unit) can process the audio data at any time, even if The CPU cannot sleep while the user is not speaking. Since the CPU needs to encode, decode, play, and implement various functions of various audio data, the scheme has higher specifications on the CPU, and the power consumption of the entire system is also very large, for the battery-powered mobile. For the terminal, it will greatly shorten its standby time.
  • CPU Central Processing Unit
  • the prior art has proposed an external power supply for power supply, or a physical button to wake up, but if an external power supply is used, it will inevitably affect its mobility, and if it is awakened by physical buttons, Therefore, voice wake-up cannot be realized; that is to say, in the existing solution, if it is required to maintain its mobility and voice wake-up function, it is necessary to consume a large amount of battery power, which will greatly reduce the standby time of the mobile terminal and affect the mobile terminal. performance.
  • the embodiment of the invention provides a voice recognition method, device and storage medium; the system power consumption can be reduced, so that the standby time of the mobile terminal is prolonged and the performance of the mobile terminal is improved under the premise of maintaining the mobility and the voice wake-up function.
  • an embodiment of the present invention provides a voice recognition method, including:
  • DSP Digital Signal Processing
  • the DSP is woken up by the DSP, and the CPU is used for semantic analysis of the audio data.
  • the performing fuzzy speech recognition on the audio data by using a digital signal processor includes:
  • the audio data is speech-recognized by fuzzy clustering analysis, and the fuzzy speech recognition result is obtained.
  • the digital signal processor performs speech recognition on the audio data by using a fuzzy clustering analysis to obtain a fuzzy speech recognition result, including:
  • fuzzy clustering neural network as an estimator of a probability density function, predicting a probability that the audio data includes an awakening word
  • the performing fuzzy speech recognition on the audio data by using a digital signal processor includes:
  • the audio data is speech-recognized by using a fuzzy matching algorithm to obtain a fuzzy speech recognition result.
  • the digital signal processor performs a voice recognition on the audio data by using a fuzzy matching algorithm to obtain a fuzzy speech recognition result, including:
  • a feature map of each word pronunciation in the audio data is analyzed to obtain a feature map to be matched;
  • the method further includes:
  • the audio data is semantically analyzed by the central processor, and the corresponding operation of the analysis result is performed according to the analysis result.
  • the method before the semantic analysis of the audio data by the central processing unit, the method further includes:
  • the central processor When the speech recognition result indicates that there is no wake-up word, the central processor is set to sleep and returns to the step of performing acquisition of audio data.
  • the performing, by the central processor, performing voice recognition on the wake data includes:
  • voice recognition is performed on the wakeup data.
  • the semantic analysis of the audio data by the central processor includes:
  • the audio data is semantically analyzed.
  • the semantic analysis of the audio data by the central processor includes:
  • the audio data is semantically analyzed.
  • the method before performing the fuzzy speech recognition on the audio data by using a digital signal processor, the method further includes:
  • Noise reduction and/or echo cancellation processing is performed on the audio data.
  • the performing the corresponding operations according to the analysis result includes:
  • the operation content is performed on the operation object.
  • an embodiment of the present invention provides a voice recognition apparatus, including:
  • a fuzzy identification unit configured to perform fuzzy speech recognition on the audio data by using a DSP
  • a wake-up unit configured to wake up a CPU in a sleep state when the fuzzy speech recognition result indicates that the wake-up word exists, and the CPU is configured to perform semantic analysis on the audio data.
  • the fuzzy identification unit is specifically configured to perform voice recognition on the audio data by using a fuzzy clustering analysis by using a DSP to obtain a fuzzy speech recognition result.
  • the fuzzy identification unit may be specifically configured to: establish a fuzzy clustering neural network according to fuzzy clustering analysis; use the fuzzy clustering neural network as an estimator of a probability density function, and include an awakening word for the audio data.
  • the probability is predicted; if the prediction result indicates that the probability is greater than or equal to the set value, a fuzzy speech recognition result indicating that the wake-up word exists is generated; and if the predicted result indication probability is less than the set value, a fuzzy speech recognition result indicating that the wake-up word does not exist is generated.
  • the fuzzy identification unit is specifically configured to perform voice recognition on the audio data by using a fuzzy matching algorithm by using a DSP to obtain a fuzzy speech recognition result.
  • the fuzzy identification unit may be specifically configured to obtain a feature map of the wake-up word pronunciation, obtain a standard feature map, analyze a feature map of each word pronunciation in the audio data, and obtain a feature map to be matched; according to a preset membership degree
  • the function calculates a degree value of each of the to-be-matched feature maps belonging to the standard feature map; if the degree value is greater than or equal to the preset value, generating a fuzzy speech recognition result indicating that the wake-up word exists; if the degree value is less than the preset value, generating A fuzzy speech recognition result indicating that there is no wake-up word.
  • the voice recognition apparatus may further include a processing unit, configured to perform semantic analysis on the audio data by using a CPU, and perform a corresponding operation according to the analysis result.
  • the speech recognition apparatus may further include a precise identification unit as follows:
  • the precise identification unit is configured to read data of the wake-up word in the audio data from the DSP to obtain wake-up data; perform voice recognition on the wake-up data by the CPU; and when the voice recognition result indicates that the wake-up word exists
  • the trigger processing unit performs an operation of performing semantic analysis on the audio data by the CPU; when the voice recognition result indicates that there is no wake-up word, the CPU is set to sleep, and the acquisition unit is triggered to perform an operation of acquiring audio data.
  • the precise identification unit may be configured to set an operating state of the CPU to a first state, where the first state is a single core and a low frequency, and in the first state, the wake data is performed. Speech Recognition.
  • the processing unit may be configured to set an operating state of the CPU to a second state, where the second state is a multi-core and a high frequency, and in the second state, Audio data is semantically analyzed.
  • the processing unit may be specifically configured to determine a semantic scenario according to the wake-up word corresponding to the audio data, determine a working core number and a primary frequency of the CPU according to the semantic scenario, according to the working core number and the primary The frequency size sets the working state of the CPU to obtain a third state in which the audio data is semantically analyzed.
  • the voice recognition device may further include a filtering unit, as follows:
  • the filtering unit is configured to perform noise reduction and/or echo cancellation processing on the audio data.
  • an embodiment of the present invention further provides a mobile terminal, where the mobile terminal includes a storage medium and a processor, where the storage medium stores a plurality of instructions, the processor is configured to load and execute the instruction,
  • the instructions are used to implement the steps in any of the speech recognition methods provided by the embodiments of the present invention.
  • the embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are adapted to be loaded by a processor to perform any of the voice recognition methods provided by the embodiments of the present invention. The steps in .
  • the embodiment of the present invention may perform fuzzy speech recognition on the audio data through the DSP.
  • the DSP wakes up the CPU in the sleep state, and the over-CPU can be used for the audio.
  • the data is semantically analyzed. Because the scheme uses a DSP with lower running power to replace the CPU with higher power consumption to monitor the audio data, the CPU does not need to be awake all the time, but can be in a dormant state and when needed.
  • the solution can greatly reduce system power consumption while preserving mobility and voice wake-up functions, thus extending mobile
  • the standby time of the terminal improves the performance of the mobile terminal.
  • FIG. 1 is a structural diagram of a mobile terminal according to an embodiment of the present invention.
  • FIG. 1b is a schematic diagram of a scenario of a voice recognition method according to an embodiment of the present invention.
  • FIG. 1c is a flowchart of a voice recognition method according to an embodiment of the present invention.
  • FIG. 1d is a block diagram of a voice recognition method according to an embodiment of the present invention.
  • FIG. 2a is another flowchart of a voice recognition method according to an embodiment of the present invention.
  • 2b is another block diagram of a voice recognition method according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 3b is another schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • 3c is another schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a mobile terminal according to an embodiment of the present invention.
  • Embodiments of the present invention provide a voice recognition method, apparatus, and storage medium.
  • the voice recognition device may be specifically integrated in a mobile terminal, such as a mobile phone, a wearable smart device, a tablet computer, and/or a laptop computer.
  • a DSP may be set in the mobile terminal.
  • the DSP may be set in a codec (Coder-decoder).
  • a DSP-capable codec such that when the mobile terminal acquires audio data, such as receiving a sound from a user via a microphone (MIC, Microphone), the audio data can be subjected to fuzzy speech recognition through the DSP.
  • the fuzzy speech recognition result indicates that there is a wake-up word, and the DSP wakes up the sleepy CPU, and the CPU can be used for semantic analysis of the audio data, for example, see FIG. 1b; otherwise, if the fuzzy speech recognition result indicates that there is no wake-up word , instead of waking up the CPU, the DSP continues to listen to the audio data.
  • DSP is a kind of microprocessor that is especially suitable for digital signal processing operations. It can realize various digital signal processing algorithms in real time and quickly, because of its low overhead or no overhead loop and jump.
  • the hardware supports the features, so the power consumption is lower than other processors; in addition, the DSP also has the function of noise reduction.
  • a voice recognition device which may be integrated in a device such as a mobile terminal, which may include a mobile phone, a wearable smart device, a tablet computer, and/or a laptop computer. And other equipment.
  • the embodiment provides a voice recognition method, including: acquiring audio data, and performing fuzzy voice recognition on the audio data by using a DSP.
  • the DSP wakes up the CPU in a sleep state, and the CPU uses the CPU. Semantic analysis of the audio data.
  • the specific process of the voice recognition method can be as follows:
  • the audio data can be collected by an MIC, such as a MIC module built in the mobile terminal.
  • an MIC such as a MIC module built in the mobile terminal.
  • the audio data may include data converted into various forms of sound, and the type of the sound may be not limited, for example, it may be a voice, an animal sound, a sound of an object, and/or music, etc. Wait.
  • fuzzy speech recognition There may be multiple ways of fuzzy speech recognition. For example, fuzzy cluster analysis may be used to perform speech recognition on the audio data, or a fuzzy matching algorithm may be used to perform speech recognition on the audio data, etc.
  • fuzzy cluster analysis may be used to perform speech recognition on the audio data
  • a fuzzy matching algorithm may be used to perform speech recognition on the audio data, etc.
  • the step "fuzzy speech recognition of the audio data by the DSP" may be as follows:
  • the fuzzy data is used to perform speech recognition on the audio data, and the fuzzy speech recognition result is obtained.
  • a fuzzy clustering neural network may be established according to the fuzzy clustering analysis, and the fuzzy clustering neural network is used as an estimator of the probability density function to predict the probability that the audio data includes the wake-up word, if the prediction result indicates a probability greater than or equal to The set value generates a fuzzy speech recognition result indicating that the wake-up word exists, and if the prediction result indicates that the probability is less than the set value, a fuzzy speech recognition result indicating that the wake-up word does not exist is generated.
  • fuzzy clustering analysis generally refers to constructing fuzzy matrices according to the attributes of the research objects themselves, and based on this, the clustering relationship is determined according to a certain degree of membership, that is, the fuzzy relations are used to quantify the fuzzy relations between samples. The determination is to perform clustering objectively and accurately. Clustering is to divide the data set into multiple classes or clusters, so that the data difference between the classes should be as large as possible, and the data difference between the classes should be as small as possible.
  • the set value can be set according to the requirements of the actual application, and details are not described herein again.
  • the fuzzy matching algorithm is used to perform speech recognition on the audio data, and the fuzzy speech recognition result is obtained.
  • a feature map of the wake-up speech can be obtained, a standard feature map is obtained, and a feature map of each word in the audio data is analyzed, and a feature map to be matched is obtained, and then each candidate feature is calculated according to a preset membership function.
  • the map belongs to the degree value of the standard feature map. If the degree value is greater than or equal to the preset value, a fuzzy speech recognition result indicating that the wake-up word exists is generated. Otherwise, if the degree value is less than the preset value, generating an indication that the wake-up word does not exist. Fuzzy speech recognition results.
  • the membership function and the preset value may be set according to the requirements of the actual application, and are not described herein again.
  • the audio data may also be subjected to filtering processing such as noise reduction and/or echo cancellation, that is, as shown in FIG. 1d.
  • filtering processing such as noise reduction and/or echo cancellation
  • the voice recognition method may further include:
  • the audio data is subjected to noise reduction and/or echo cancellation processing to obtain processed audio data.
  • step the step of performing fuzzy speech recognition on the audio data by using the DSP may be: performing fuzzy speech recognition on the processed audio data by using a DSP.
  • the DSP wakes up the CPU in the sleep state, that is, the operating program of the CPU is activated by the DSP, for example, the related running program of the recording and audio data in the CPU may be specifically activated.
  • the wake-up words may be one or multiple, and the wake-up words may be preset according to actual application requirements. For example, taking the wake-up words including "calling" and “sending information" as an example, when the fuzzy speech recognition result indicates that the word “calling” or “sending information" exists in the audio data, the CPU can be woken up by the DSP. And so on, and so on.
  • the voice recognition method may further include:
  • the audio data is semantically analyzed by the CPU, and corresponding operations are performed according to the analysis result.
  • the operation object and the operation content may be specifically determined according to the analysis result, and then the operation content is performed on the operation object, and the like.
  • the CPU may also The audio data is further identified, that is, before the step of “semantic analysis of the audio data by the CPU”, the voice recognition method may further include:
  • the CPU is set to sleep and returns to the step of performing acquisition of audio data (ie, step 101).
  • the CPU may not open all cores, but use a single core and a low frequency to perform arithmetic processing, that is, the step “speech recognition of the wakeup data by the CPU” may include:
  • the working state of the CPU is set to a single core and a low frequency, so that the CPU performs voice recognition on the wakeup data in the working state.
  • the "single core and low frequency" working state is referred to as a first state, that is, the CPU can perform voice recognition on the wake data in the first state.
  • the number of cores may be increased, and the frequency is raised to perform semantic analysis on the audio data, that is, the step “semantic analysis of the audio data by the CPU” may include :
  • the working state of the CPU is set to multi-core and high frequency, and in the working state, the audio data is semantically analyzed by the CPU.
  • the “multi-core and high-frequency” working state is referred to as a second state, that is, the working state of the CPU may be set to a second state, in the second state.
  • the semantic analysis of the audio data is performed.
  • the multi-core refers to two or more complete calculation engines (cores) integrated in the processor; the low frequency refers to the main frequency lower than the preset frequency, the high frequency The frequency of the preset frequency is higher than or equal to the preset frequency.
  • the preset frequency may be determined according to the requirements of the actual application, and details are not described herein again.
  • Semantic analysis of the audio data may include:
  • Determining a semantic scenario according to the wake-up word corresponding to the audio data determining a working core number and a primary frequency of the CPU according to the semantic scenario, and setting a working state of the CPU according to the working core number and the primary frequency to obtain a third state, where In the third state, the audio data is semantically analyzed.
  • the audio data can be semantically analyzed with a lower working core number and a dominant frequency, and in the semantic context of “searching”, a higher working core can be used.
  • the audio data can be subjected to fuzzy speech recognition by the DSP.
  • the DSP wakes up the CPU in the sleep state, and the CPU can be used for
  • the audio data is semantically analyzed. Because the scheme uses a DSP with lower running power to replace the CPU with higher power consumption to monitor the audio data, the CPU does not need to be awake all the time, but can be in a dormant state and when needed.
  • the solution can greatly reduce system power consumption while preserving mobility and voice wake-up functions, thus extending mobile
  • the standby time of the terminal improves the performance of the mobile terminal.
  • the voice recognition device is specifically integrated into the mobile terminal as an example for description.
  • a voice recognition method As shown in FIG. 2a, a voice recognition method, the specific process can be as follows:
  • the mobile terminal collects the audio data by using a MIC.
  • the MIC may be independent of the mobile terminal or may be built in the mobile terminal.
  • the audio data may include data converted into various forms of sound, and the type of the sound may be not limited, for example, it may be a voice, an animal sound, a sound of an object, and/or music, etc. Wait.
  • step 203 is performed. Otherwise, if the fuzzy speech recognition result indicates that there is no wake-up word, the process returns to step 201.
  • the wake-up words may be one or more, and the wake-up words may be set in advance according to actual application requirements, for example, “calling”, “sending information”, “* who is”, “who” Yes*”, “What is *”, and/or “What is *”, etc., where “*” can be any noun, such as “Who is Zhang San”, “Who is Li Si”, or “Java What is it, and so on, and so on.
  • the DSP may be disposed in a codec of the mobile terminal (ie, Codec), for example, as shown in FIG. 1a.
  • the codec can compress and decompress (ie, encode and de-encode) the audio data; when the MIC collects the audio data, the audio data is transmitted to the codec for processing, such as compression and/or decompression. After processing, it is then transmitted to the DSP for fuzzy speech recognition.
  • fuzzy speech recognition There may be multiple ways of fuzzy speech recognition. For example, fuzzy cluster analysis may be used to perform speech recognition on the audio data, or a fuzzy matching algorithm may be used to perform speech recognition on the audio data, etc., for example, for example, Specifically, it can be as follows:
  • the mobile terminal uses the fuzzy clustering analysis to perform speech recognition on the audio data through the DSP, and obtains the fuzzy speech recognition result.
  • the DSP may specifically establish a fuzzy clustering neural network according to the fuzzy clustering analysis, and then use the fuzzy clustering neural network as an estimator of the probability density function to predict the probability that the audio data includes the wake-up word, if the prediction result indicates If the probability is greater than or equal to the set value, a fuzzy speech recognition result indicating that the wake-up word exists is generated. Otherwise, if the predicted result indicating probability is less than the set value, a fuzzy speech recognition result indicating that the wake-up word does not exist is generated.
  • the set value can be set according to the requirements of the actual application, and details are not described herein again.
  • the mobile terminal uses the fuzzy matching algorithm to perform speech recognition on the audio data through the DSP, and obtains the fuzzy speech recognition result.
  • the DSP may obtain a feature map of the wake-up speech, obtain a standard feature map, and analyze a feature map of each word in the audio data to obtain a feature map to be matched, and then calculate each to-be-match according to a preset membership function.
  • the feature map belongs to the degree value of the standard feature map. If the degree value is greater than or equal to the preset value, a fuzzy speech recognition result indicating that the wake-up word exists is generated. Otherwise, if the degree value is less than the preset value, the indication indicates that no wake-up word exists. Fuzzy speech recognition results.
  • the membership function and the preset value may be set according to the requirements of the actual application.
  • the degree to which the to-be-matched feature map belongs to the standard feature map may also be represented by the membership degree, and the closer the membership degree is to 1, indicating that the to-be-matched The higher the degree of the feature map belongs to the standard feature map, the closer the membership degree is to 0, the lower the degree that the feature map to be matched belongs to the standard feature map, and details are not described herein again.
  • the audio data may be subjected to filtering processing such as noise reduction and/or echo cancellation, that is, as shown in FIG. 2b.
  • filtering processing such as noise reduction and/or echo cancellation
  • the speech recognition method may further include:
  • the mobile terminal performs noise reduction and/or echo cancellation processing on the audio data to obtain processed audio data.
  • the step “the mobile terminal performs fuzzy speech recognition on the audio data through the DSP” may specifically be: the mobile terminal performs fuzzy speech recognition on the processed audio data through the DSP.
  • the DSP wakes up the CPU in the sleep state.
  • the running program of the CPU may be activated by the DSP, for example, the related running program of the recording and audio data in the CPU may be activated, and the like.
  • the CPU can be woken up by the DSP. And so on, and so on.
  • the mobile terminal reads data of the wake-up word in the audio data through the DSP, and obtains wake-up data.
  • the mobile terminal can read the A-segment data. This segment A data is used as wake-up data.
  • the mobile terminal can read the B-segment data at this time. , the B segment data as wake-up data, and so on, and so on.
  • the mobile terminal performs voice recognition on the wakeup data by using the CPU.
  • step 206 is performed. Otherwise, when the voice recognition result indicates that there is no wakeup word, the CPU is set to sleep and returns.
  • the step of acquiring audio data is performed (ie, step 201.
  • the DSP can be specifically notified to perform an operation of performing speech recognition on the audio data, see FIG. 2b.
  • the CPU may not open all cores, but use a single core and a low frequency to perform arithmetic processing, that is, the step “speech recognition of the wakeup data by the CPU” may include:
  • the working state of the CPU is set to a first state, that is, set to a single core and low frequency, so that the CPU performs voice recognition on the wakeup data in the first state.
  • Steps 204 and 205 are optional steps.
  • the mobile terminal performs semantic analysis on the audio data by using a CPU.
  • the working state of the CPU may be set to the second state, that is, set to multi-core and high frequency, and in the second state, the audio data is semantically analyzed by the CPU.
  • the power consumption and processing efficiency can be better balanced, and the working core number and the primary frequency of the CPU can be adjusted according to a specific voice scenario; for example, the mobile terminal can Determining a semantic scenario according to the wake-up word corresponding to the audio data, and then determining a working core number and a primary frequency of the CPU according to the semantic scenario, and setting a working state of the CPU according to the working core number and the primary frequency (ie, a third state) And in this working state, the audio data is semantically analyzed.
  • the working core of the CPU needs to be a single core, and the primary frequency is X mhz; in the semantic scenario corresponding to the "sending information”, the working core of the CPU needs to be a single core.
  • the main frequency is Y mhz; in the semantic scenario corresponding to “search”, the working core of the CPU needs to be dual-core, and the main frequency is Z mhz; the specifics can be as follows:
  • the working core of the CPU can be set to a single core, and the primary frequency is set to X mhz, and then, in the working state, the audio data is semantically analyzed by the CPU.
  • the working core number of the CPU can be set to a single core, and the main frequency size is set to Y mhz, and then, in the working state, the audio data is semantically analyzed by the CPU.
  • the working core number of the CPU can be set to dual core, and the main frequency size is set to Z mhz , and then, in the working state, the audio data is semantically analyzed by the CPU.
  • the mobile terminal can continue to collect other audio data through the MIC, and perform semantic analysis by the awake CPU, and perform corresponding operations according to the analysis result, where
  • the manner of the semantic analysis and the manner of performing the corresponding operations according to the analysis result refer to steps 206 and 207, and details are not described herein again.
  • the mobile terminal performs a corresponding operation according to the analysis result.
  • the operation object and the operation content may be determined according to the analysis result, and then the operation content is performed on the operation object by the CPU, and the like.
  • the mobile terminal can determine that the operation object is "the telephone number of Zhang San in the address book", and the operation content is "calling the telephone number”, so that the communication can be dialed through the CPU at this time. Record the phone number of Zhang San, thus completing the task of “calling Zhang San”.
  • the mobile terminal can determine that the operation object is "search engine application”, and the operation content is "searching for a keyword “poetry” through a search engine application", so that the mobile terminal can be activated at this time.
  • search engine application and through the search engine application search keyword "poetry”, to complete the task of "search poetry", and so on, and so on.
  • the audio data can be subjected to fuzzy speech recognition by the DSP.
  • the DSP wakes up the CPU in the sleep state, and the CPU adopts a single core and The low-frequency working state confirms whether there is a wake-up word again. If the CPU determines that there is no wake-up word, the CPU switches to the sleep state, and the DSP continues to listen.
  • the audio data is only used by the CPU when the CPU determines that there is an wake-up word.
  • the scheme uses a DSP with lower running power, instead of running a CPU with higher power consumption to monitor the audio data, the CPU does not need to be awake all the time. Instead, it can be in a dormant state and be woken up when needed; therefore, the solution can retain mobility and voice wake-up functionality compared to existing solutions that can only be woken up by external power or through physical buttons. Under the premise, the system power consumption is greatly reduced, thereby prolonging the standby time of the mobile terminal and improving the performance of the mobile terminal.
  • the scheme can recognize the wake-up words by the DSP, the wake-up words can be recognized again by the CPU, so the recognition accuracy is high, and, because the CPU recognizes the wake-up words, the It is a low-power operating state (such as single-core and low-frequency). Only when the wake-up word is determined, the CPU uses the higher-power working state for semantic analysis. Therefore, the utilization of resources is more reasonable and effective. It is beneficial to further improve the performance of the mobile terminal.
  • the embodiment of the present invention further provides a voice recognition device, which may be integrated in a mobile terminal, such as a mobile phone, a wearable smart device, a tablet computer, and/or a laptop computer. .
  • a voice recognition device which may be integrated in a mobile terminal, such as a mobile phone, a wearable smart device, a tablet computer, and/or a laptop computer.
  • the voice recognition apparatus may include an acquisition unit 301, a blur recognition unit 302, and a wakeup unit 303, as follows:
  • the obtaining unit 301 is configured to acquire audio data.
  • the obtaining unit 301 can be specifically configured to collect the audio data by using a MIC, such as a MIC module built in the mobile terminal.
  • a MIC such as a MIC module built in the mobile terminal.
  • the fuzzy identification unit 302 is configured to perform fuzzy speech recognition on the audio data by using a DSP.
  • fuzzy speech recognition There may be multiple ways of fuzzy speech recognition. For example, fuzzy cluster analysis may be used to perform speech recognition on the audio data, or a fuzzy matching algorithm may be used to perform speech recognition on the audio data, etc. :
  • the fuzzy identification unit 302 can be specifically configured to perform voice recognition on the audio data by using a fuzzy cluster analysis to obtain a fuzzy speech recognition result.
  • the fuzzy identification unit 302 may be specifically configured to establish a fuzzy clustering neural network according to the fuzzy clustering analysis, and use the fuzzy clustering neural network as an estimator of the probability density function to predict the probability that the audio data includes the wake-up word. And if the prediction result indication probability is greater than or equal to the set value, generating a fuzzy speech recognition result indicating that the wake-up word exists; and if the prediction result indication probability is less than the set value, generating a fuzzy speech recognition result indicating that the wake-up word does not exist.
  • the set value can be set according to the requirements of the actual application, and details are not described herein again.
  • the fuzzy identification unit 302 is specifically configured to perform voice recognition on the audio data by using a fuzzy matching algorithm, and obtain a fuzzy speech recognition result.
  • the fuzzy identification unit 302 may be specifically configured to obtain a feature map of the wake-up word pronunciation, obtain a standard feature map, analyze a feature map of each word pronunciation in the audio data, and obtain a feature map to be matched, according to a preset membership function. Calculating a degree value of each of the to-be-matched feature maps belonging to the standard feature map, and if the degree value is greater than or equal to the preset value, generating a fuzzy speech recognition result indicating that the wake-up word exists; if the degree value is less than the preset value, the generating indication does not exist The fuzzy speech recognition result of the wake-up word.
  • the membership function and the preset value may be set according to the requirements of the actual application, and are not described herein again.
  • the speech recognition apparatus can further include a processing unit 304, as shown in FIG. 3b:
  • the processing unit 304 is configured to perform semantic analysis on the audio data by using a CPU, and perform a corresponding operation according to the analysis result.
  • the processing unit 304 may be specifically configured to perform semantic analysis on the audio data by using a CPU, and determine an operation object and an operation content according to the analysis result, and then execute the operation content on the operation object, and the like.
  • the audio data may be subjected to filtering processing such as noise reduction and/or echo cancellation, that is, as shown in FIG. 3c.
  • the voice recognition device may further include a filtering unit 305 as follows:
  • the filtering unit 305 can be configured to perform noise reduction and/or echo cancellation processing on the audio data.
  • the fuzzy identification unit 302 can be used to perform fuzzy speech recognition on the audio data processed by the filtering unit 305.
  • the waking unit 303 can be configured to wake up the CPU in the sleep state when the fuzzy speech recognition result indicates that the wake-up word exists.
  • the wake-up word may be one or more, and the wake-up word may be set in advance according to the requirements of the actual application, and details are not described herein again.
  • the audio data may be further identified before the processing unit 304 performs semantic analysis on the audio data by the CPU, that is, as shown in FIG. 3c.
  • the speech recognition device may also include a precision identification unit 306 as follows:
  • the precise identification unit 306 can be configured to read data of the wake-up word in the audio data from the DSP to obtain wake-up data; perform voice recognition on the wake-up data by the CPU; and trigger when the voice recognition result indicates that the wake-up word exists
  • the processing unit 304 performs an operation of performing semantic analysis on the audio data by the CPU; when the voice recognition result indicates that there is no wake-up word, the CPU is set to sleep, and the acquisition unit is triggered to perform an operation of acquiring audio data.
  • the CPU when the CPU is woken up, the CPU may not be turned on, but the single core and the low frequency are used for processing, that is:
  • the precise identification unit 306 can be specifically configured to set the working state of the CPU to a first state, in which the wake-up data is voice-recognized, wherein the first state is a single core and a low frequency.
  • the number of cores may be increased, and the frequency is raised to perform semantic analysis on the audio data, namely:
  • the processing unit 304 is specifically configured to set an operating state of the CPU to a second state, where the audio data is semantically analyzed, wherein the second state is a multi-core and a high frequency.
  • the processing unit 304 is specifically configured to determine a semantic scenario according to the wake-up word corresponding to the audio data, determine a working core number and a primary frequency of the CPU according to the semantic scenario, and perform an operating state of the CPU according to the working core number and the primary frequency. Setting, a third state is obtained, in which the audio data is semantically analyzed.
  • the foregoing units may be implemented as a separate entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities.
  • the foregoing method implementation and details are not described herein.
  • the voice recognition device of the present embodiment can perform fuzzy voice recognition on the audio data by the blur recognition unit 302 after acquiring the audio data by the acquisition unit 301, and wake up by the wakeup unit 303 when it is determined that the wakeup word exists.
  • CPU in sleep state This CPU can be used for semantic analysis of the audio data. Because the scheme uses a DSP with lower running power to replace the CPU with higher power consumption to monitor the audio data, the CPU does not need to be awake all the time, but can be in a dormant state and when needed.
  • the solution can greatly reduce system power consumption while preserving mobility and voice wake-up functions, thus extending mobile
  • the standby time of the terminal improves the performance of the mobile terminal.
  • the embodiment of the present invention further provides a mobile terminal.
  • the mobile terminal may include a radio frequency (RF) circuit 401, a memory 402 including one or more computer readable storage media, The input unit 403, the display unit 404, the sensor 405, the audio circuit 406, the Wireless Fidelity (WiFi) module 407, the processor 408 including one or more processing cores, and the power source 409 and the like.
  • RF radio frequency
  • the mobile terminal structure shown in FIG. 4 does not constitute a limitation of the mobile terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the RF circuit 401 can be used for transmitting and receiving information or during a call, and receiving and transmitting signals. Specifically, after receiving downlink information of the base station, the downlink information is processed by one or more processors 408. In addition, the data related to the uplink is sent to the base station. .
  • the RF circuit 401 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a low noise amplifier (LNA, Low Noise Amplifier), duplexer, etc. In addition, the RF circuit 401 can also communicate with the network and other devices through wireless communication.
  • SIM Subscriber Identity Module
  • LNA Low Noise Amplifier
  • the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), and Code Division Multiple Access (CDMA). , Code Division Multiple Access), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), e-mail, Short Messaging Service (SMS), and the like.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Messaging Service
  • the memory 402 can be used to store software programs and modules, and the processor 408 executes various functional applications and data processing by running software programs and modules stored in the memory 402.
  • the memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of the mobile terminal (such as audio data, phone book, etc.).
  • memory 402 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 402 may also include a memory controller to provide access to memory 402 by processor 408 and input unit 403.
  • Input unit 403 can be used to receive input numeric or character information, as well as to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 403 can include a touch-sensitive surface as well as other input devices.
  • Touch-sensitive surfaces also known as touch screens or trackpads, collect touch operations on or near the user (such as the user using a finger, stylus, etc., any suitable object or accessory on a touch-sensitive surface or touch-sensitive Operation near the surface), and drive the corresponding connecting device according to a preset program.
  • the touch sensitive surface may include two parts of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 408 is provided and can receive commands from the processor 408 and execute them.
  • touch-sensitive surfaces can be implemented in a variety of types, including resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 403 can also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • Display unit 404 can be used to display information entered by the user or information provided to the user as well as various graphical user interfaces of the mobile terminal, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display unit 404 can include a display panel.
  • the display panel can be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
  • the touch-sensitive surface can cover the display panel, and when the touch-sensitive surface detects a touch operation thereon or nearby, it is transmitted to the processor 408 to determine the type of the touch event, and then the processor 408 displays the type according to the type of the touch event. A corresponding visual output is provided on the panel.
  • the touch-sensitive surface and display panel are implemented as two separate components to perform input and input functions, in some embodiments, the touch-sensitive surface can be integrated with the display panel to implement input and output functions.
  • the mobile terminal may also include at least one type of sensor 405, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel according to the brightness of the ambient light, and the proximity sensor may close the display panel and/or when the mobile terminal moves to the ear.
  • the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • gesture of the mobile phone such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for the gyroscope, barometer, hygrometer, thermometer, infrared sensor and other sensors that can be configured in the mobile terminal, Let me repeat.
  • the audio circuit 406, the speaker, and the microphone provide an audio interface between the user and the mobile terminal.
  • the audio circuit 406 can transmit the converted electrical signal of the audio data to the speaker, and convert it into a sound signal output by the speaker; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 406 and then converted.
  • the audio data is processed by the audio data output processor 408, transmitted via the RF circuit 401 to, for example, another mobile terminal, or the audio data is output to the memory 402 for further processing.
  • the audio circuit 406 may also include an earbud jack to provide communication between the peripheral earphone and the mobile terminal.
  • WiFi is a short-range wireless transmission technology.
  • the mobile terminal can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 407, which provides wireless broadband Internet access for users.
  • FIG. 4 shows the WiFi module 407, it can be understood that it does not belong to the essential configuration of the mobile terminal, and may be omitted as needed within the scope of not changing the essence of the invention.
  • the processor 408 is the control center of the mobile terminal, connecting various portions of the entire handset using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 402, and recalling data stored in the memory 402, Perform various functions of the mobile terminal and process data to monitor the mobile phone as a whole.
  • the processor 408 may include one or more processing cores; preferably, the processor 408 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, an application, and the like.
  • the modem processor primarily handles wireless communications. It will be appreciated that the above described modem processor may also not be integrated into the processor 408.
  • the mobile terminal also includes a power source 409 (such as a battery) for powering various components.
  • the power source can be logically coupled to the processor 408 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • the power supply 409 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the mobile terminal may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the processor 408 in the mobile terminal loads the executable file corresponding to the process of one or more applications into the memory 402 according to the following instruction, and is stored and stored by the processor 408.
  • the audio data is obtained, and the audio data is subjected to fuzzy speech recognition by the DSP.
  • the DSP wakes up the CPU in the sleep state, and the CPU performs semantic analysis on the audio data.
  • the CPU After the CPU is woken up, the CPU can perform semantic analysis on the audio data and perform corresponding operations according to the analysis result.
  • the fuzzy clustering analysis or the fuzzy matching algorithm may be used to perform voice recognition on the audio data, and so on.
  • the fuzzy clustering analysis or the fuzzy matching algorithm may be used to perform voice recognition on the audio data, and so on.
  • the audio data may be subjected to filtering processing such as noise reduction and/or echo cancellation, that is, the processor 408 may also run the storage.
  • filtering processing such as noise reduction and/or echo cancellation
  • the processor 408 may also run the storage.
  • the application in memory 402 thus implements the following functions:
  • the audio data is subjected to noise reduction and/or echo cancellation processing to obtain processed audio data.
  • the audio data may be further identified by the CPU before the semantic analysis of the audio data by the CPU, that is, the processor 408 may also run the storage.
  • the application in memory 402 thus implements the following functions:
  • the operation otherwise, when the speech recognition result indicates that there is no wake-up word, the CPU is set to sleep, and the operation of acquiring the audio data is returned.
  • the mobile terminal of the embodiment can perform fuzzy speech recognition on the audio data through the DSP.
  • the DSP wakes up the CPU in the sleep state, and the CPU can Used for semantic analysis of the audio data. Because the scheme uses a DSP with lower running power to replace the CPU with higher power consumption to monitor the audio data, the CPU does not need to be awake all the time, but can be in a dormant state and when needed.
  • the solution can greatly reduce system power consumption while preserving mobility and voice wake-up functions, thus extending mobile
  • the standby time of the terminal improves the performance of the mobile terminal.
  • an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, which can be loaded by a processor to perform the steps in any of the voice recognition methods provided by the embodiments of the present invention.
  • the instruction can perform the following steps:
  • the audio data is obtained, and the audio data is subjected to fuzzy speech recognition by the DSP.
  • the DSP wakes up the CPU in the sleep state, and the CPU performs semantic analysis on the audio data.
  • the CPU After the CPU is woken up, the CPU can perform semantic analysis on the audio data and perform corresponding operations according to the analysis result.
  • the fuzzy clustering analysis or the fuzzy matching algorithm may be used to perform voice recognition on the audio data, and so on.
  • the fuzzy clustering analysis or the fuzzy matching algorithm may be used to perform voice recognition on the audio data, and so on.
  • the audio data may be subjected to filtering processing such as noise reduction and/or echo cancellation, that is, the instruction may also perform the following steps. :
  • the audio data is subjected to noise reduction and/or echo cancellation processing to obtain processed audio data.
  • the audio data may be further identified by the CPU before the semantic analysis of the audio data by the CPU, that is, the instruction may further perform the following steps. :
  • the operation otherwise, when the speech recognition result indicates that there is no wake-up word, the CPU is set to sleep, and the operation of acquiring the audio data is returned.
  • the storage medium may include: a read only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk.
  • ROM read only memory
  • RAM random access memory
  • magnetic disk a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephone Function (AREA)

Abstract

本发明实施例公开了一种语音识别方法、装置和存储介质;本实施例在获取到音频数据后,可以通过DSP对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由该DSP唤醒处于休眠状态的CPU,该CPU用于对该音频数据进行语义分析。该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒,可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。

Description

语音识别方法、装置和存储介质
本申请要求于2017年7月19日提交中国专利局、申请号201710588382.8、申请名称为“语音识别方法、装置和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及通信技术领域,具体涉及语音识别。
背景技术
随着人工智能的发展,智能硬件产品也得到快速发展。所谓智能硬件产品,指的是集成了人工智能功能的硬件设备,比如智能移动终端(简称移动终端)等。智能硬件产品的核心必然离不开与“人”的互动,而语音交互作为自然、且学习成本低的交互方式已成为智能硬件产品的主流技术。
在语音交互中,如何进行语音唤醒是一个重要的问题。以移动终端为例,在现有技术中,为了实现语音快速唤醒,一般都会要求终端的录音功能一直处于开启状态,且中央处理器(CPU,Central Processing Unit)能够随时对音频数据进行处理,即便在用户未说话时,CPU也不能休眠。由于CPU需要对各种音频数据进行编解码、播放、以及实现其他各种功能,因此,该方案对CPU的规格要求较高,而且,整个系统的功耗也非常大,对于使用电池供电的移动终端而言,会大大缩短其待机时间。为此,现有技术又提出了采用外接电源来进行供电,或采用一个物理按键来进行唤醒的方案,但是,若采用外接电源,则势必会影响其移动性,而若通过物理按键来唤醒,则无法实现语音唤醒;也就是说,在现有方案中,若需要保持其移动性和语音唤醒功能,则必然需要消耗大量的电池电量,这将大大减少移动终端的待机时间,影响移动终端的性能。
发明内容
本发明实施例提供一种语音识别方法、装置和存储介质;可以减少系统功耗,使得在保持移动性和语音唤醒功能的前提下,延长移动终端的待机时间,改善移动终端的性能。
第一方面,本发明实施例提供一种语音识别方法,包括:
获取音频数据;
通过数字信号处理器(DSP,Digital Signal Processing)对所述音频数据进行模糊语音识别;
当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU,所述CPU用于对所述音频数据进行语义分析。
在一部分实施例中,所述通过数字信号处理器对所述音频数据进行模糊语音识别,包括:
通过数字信号处理器,采用模糊聚类分析对所述音频数据进行语音识别,得到模糊语音识别结果。
在一部分实施例中,所述通过数字信号处理器,采用模糊聚类分析对所述音频数据进行语音识别,得到模糊语音识别结果,包括:
根据模糊聚类分析建立模糊聚类神经网络;
将所述模糊聚类神经网络作为概率密度函数的估计器,对所述音频数据包含唤醒词的概率进行预测;
若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果;
若预测结果指示概率小于所述设定值,则生成指示不存在唤醒词的模糊语音识别结果。
在一部分实施例中,所述通过数字信号处理器对所述音频数据进行模糊语音识别,包括:
通过数字信号处理器,采用模糊匹配算法对所述音频数据进行语音识别,得到模糊语音识别结果。
在一部分实施例中,所述通过数字信号处理器,采用模糊匹配算法对所述音频数据进行语音识别,得到模糊语音识别结果,包括:
获取唤醒词读音的特征图,得到标准特征图;
分析所述音频数据中各个单词读音的特征图,得到待匹配特征图;
根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值;
若所述程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果;
若所述程度值小于所述预设值,则生成指示不存在唤醒词的模糊语音识别结果。
在一部分实施例中,在所述由所述数字信号处理器唤醒处于休眠状态的中央处理器之后,还包括:
通过所述中央处理器对所述音频数据进行语义分析,并根据分析结果执行所述分析结果相应的操作。
在一部分实施例中,所述通过所述中央处理器对所述音频数据进行语义分析之前,还包括:
从所述数字信号处理器中读取所述音频数据中包含唤醒词的数据,得到唤醒数据;
通过所述中央处理器对所述唤醒数据进行语音识别;
当语音识别结果指示存在唤醒词时,执行通过所述中央处理器对所述音频数据进行语义分析的步骤;
当语音识别结果指示不存在唤醒词时,将所述中央处理器设置为休眠,并返回执行获取音频数据的步骤。
在一部分实施例中,所述通过所述中央处理器对所述唤醒数据进行语音识别,包括:
将所述中央处理器的工作状态设置为第一状态,所述第一状态为单核且低频;
在所述第一状态下,对所述唤醒数据进行语音识别。
在一部分实施例中,所述通过所述中央处理器对所述音频数据进行语义分析,包括:
将所述中央处理器的工作状态设置为第二状态,所述第二状态为多核且高频;
在所述第二状态下,对所述音频数据进行语义分析。
在一部分实施例中,所述通过所述中央处理器对所述音频数据进行语义分析,包括:
根据所述音频数据对应的唤醒词确定语义场景;
根据语义场景确定所述中央处理器的工作核数和主频大小;
根据所述工作核数和主频大小对所述中央处理器的工作状态进行设置,得到第三状态;
在所述第三状态下,对所述音频数据进行语义分析。
在一部分实施例中,所述通过数字信号处理器对所述音频数据进行模糊语音识别之前,还包括:
对所述音频数据进行降噪和/或回音消除处理。
在一部分实施例中,所述根据分析结果执行相应操作,包括:
根据所述分析结果确定操作对象和操作内容;
对所述操作对象执行所述操作内容。
第二方面,本发明实施例提供一种语音识别装置,包括:
获取单元,用于获取音频数据;
模糊识别单元,用于通过DSP对所述音频数据进行模糊语音识别;
唤醒单元,用于当模糊语音识别结果指示存在唤醒词时,唤醒处于休眠状态的CPU,所述CPU用于对所述音频数据进行语义分析。
在一部分实施例中,所述模糊识别单元,具体用于通过DSP,采用模糊聚类分析对所述音频数据进行语音识别,得到模糊语音识别结果。
例如,所述模糊识别单元,具体可以用于:根据模糊聚类分析建立模糊聚类神经网络;将所述模糊聚类神经网络作为概率密度函数的估计器,对所述音频数据包含唤醒词的概率进行预测;若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果;若预测结果指示概率小于设定值,则生成指示不存在唤醒词的模糊语音识别结果。
在一部分实施例中,所述模糊识别单元,具体用于通过DSP,采用模糊匹配算法对所述音频数据进行语音识别,得到模糊语音识别结果。
例如,所述模糊识别单元,具体可以用于获取唤醒词读音的特征图,得到标准特征图;分析所述音频数据中各个单词读音的特征图,得到待匹配特征图;根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值;若所述程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果;若所述程度值小于预设值,则生成指示不存在唤醒词的模糊语音识别结果。
在一部分实施例中,所述语音识别装置还可以包括处理单元,所述处理单元用于通过CPU对所述音频数据进行语义分析,并根据分析结果执行相应操作。
在一部分实施例中,所述语音识别装置还可以包括精确识别单元,如下:
所述精确识别单元,用于从DSP中读取所述音频数据中包含唤醒词的数据,得到唤醒数据;通过所述CPU对所述唤醒数据进行语音识别;当语音识别结果指示存在唤醒词时,触发处理单元执行通过CPU对所述音频数据进行语义分析的操作;当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并触发获取单元执行获取音频数据的操作。
其中,所述精确识别单元,具体可以用于将所述CPU的工作状态设置为第一状态,所述第一状态为单核且低频,在所述第一状态下,对所述唤醒数据进行语音识别。
在一部分实施例中,所述处理单元,具体可以用于将所述CPU的工作状态设置为第二状态,所述第二状态为多核且高频,在所述第二状态下,对所述音频数据进行语义分析。
在一部分实施例中,所述处理单元,具体可以用于根据所述音频数据对应的唤醒词确定语义场景,根据语义场景确定CPU的工作核数和主频大小,根据所述工作核数和主频大小对CPU的工作状态进行设置,得到第三状态,在所述第三状态下,对所述音频数据进行语义分析。
在一部分实施例中,所述语音识别装置还可以包括过滤单元,如下:
所述过滤单元,用于对所述音频数据进行降噪和/或回音消除处理。
第三方面,本发明实施例还提供一种移动终端,所述移动终端包括存储介质和处理器,所述存储介质存储有多条指令,所述处理器用于加载并执行所述指令,所述指令用于实现本发明实施例所提供的任一种语音识别方法中的步骤。
第四方面,本发明实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行本发明实施例所提供的任一种语音识别方法中的步骤。
本发明实施例在获取到音频数据后,可以通过DSP对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由该DSP唤醒处于休眠状态的CPU,该过CPU可以用于对该音频数据进行语义分析。由于该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒;所以,相对于现有方案只能通过外接电源或通过物理按键来唤醒的方案而言,该方案可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本发明实施例提供的移动终端的架构图;
图1b是本发明实施例提供的语音识别方法的场景示意图;
图1c是本发明实施例提供的语音识别方法的流程图;
图1d是本发明实施例提供的语音识别方法的框图;
图2a是本发明实施例提供的语音识别方法的另一流程图;
图2b是本发明实施例提供的语音识别方法的另一框图;
图3a是本发明实施例提供的语音识别装置的结构示意图;
图3b是本发明实施例提供的语音识别装置的另一结构示意图;
图3c是本发明实施例提供的语音识别装置的另一结构示意图;
图4是本发明实施例提供的移动终端的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种语音识别方法、装置和存储介质。
该语音识别装置具体可以集成在移动终端,比如手机、穿戴式智能设备、平板电脑、和/或笔记本电脑等设备中。
例如,以该语音识别装置集成在移动终端中为例,参见图1a,可以在移动终端中设置一DSP,比如,可以将该DSP设置在编码解码器(Codec,Coder-decoder)中(如带有DSP功能的编码解码器),这样,当移动终端获取到音频数据,如通过麦克风(MIC,Microphone)接收到用户发出的声音后,便可以通过该DSP对该音频数据进行模糊语音识别,若模糊语音识别结果指示存在唤醒词,则由DSP唤醒处于休眠状态的CPU,该CPU可以用于对该音频数据进行语义分析,比如,参见图1b;否则,若模糊语音识别结果指示不存在唤醒词,则不唤醒CPU,而是由DSP继续对音频数据进行监听。
需说明的是,DSP是一种特别适合于进行数字信号处理运算的微处理器,它可以实时快速地实现各种数字信号处理算法,而且,由于其具有低开销或无开销循环及跳转的硬件支持的特性,所以,相对于其他处理器而言,其功耗也较低;此外,DSP还具有降噪的功能。
以下分别进行详细说明。需说明的是,以下实施例的序号不作为对实施例优选顺序的限定。
实施例一、
在本实施例中,将以语音识别装置的角度进行描述,该语音识别装置具体可以集成在移动终端等设备中,该移动终端可以包括手机、穿戴式智能设备、平板电脑、和/或笔记本电脑等设备。
本实施例提供一种语音识别方法,包括:获取音频数据,通过DSP对该音频数据进行模糊语音识别,当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU,该CPU用于对该音频数据进行语义分析。
如图1c所示,该语音识别方法的具体流程可以如下:
101、获取音频数据。
例如,具体可以通过MIC,比如移动终端内置的MIC模块来采集该音频数据。
其中,该音频数据可以包括各种形式的声音所转换成的数据,该声音的类别可以不做限定,比如,可以是说话声、动物发出的声音、敲打物体的声音、和/或音乐,等等。
102、通过DSP对该音频数据进行模糊语音识别。
其中,模糊语音识别的方式可以有多种,比如,可以采用模糊聚类分析来对该音频数据进行语音识别,或者,也可以采用模糊匹配算法来对该音频数据进行语音识别,等等;即步骤“通过DSP对该音频数据进行模糊语音识别”具体可以如下:
(1)通过DSP,采用模糊聚类分析对该音频数据进行语音识别,得到模糊语音识别结果。
比如,具体可以根据模糊聚类分析建立模糊聚类神经网络,将该模糊聚类神经网络作为概率密度函数的估计器,对该音频数据包含唤醒词的概率进行预测,若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果,否则,若预测结果指示概率小于设定值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,模糊聚类分析一般是指根据研究对象本身的属性来构造模糊矩阵,并在此基础上根据一定的隶属度来确定聚类关系,即用模糊数学的方法把样本之间的模糊关系定量的确定,从而客观且准确地进行聚类。聚类就是将数据集分成多个类或簇,使得各个类之间的数据差别应尽可能大,类内之间的数据差别应尽可能小。
其中,该设定值可以根据实际应用的需求进行设置,在此不再赘述。
(2)通过DSP,采用模糊匹配算法对该音频数据进行语音识别,得到模糊语音识别结果。
比如,具体可以获取唤醒词读音的特征图,得到标准特征图,以及分析该音频数据中各个单词读音的特征图,得到待匹配特征图,然后,根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值,若该程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果,否则,若该程度值小于预设值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,该隶属度函数和预设值可以根据实际应用的需求进行设置,在此不再赘述。
可选的,为了提高语音识别的精度,在通过DSP对该音频数据进行模糊语音识别之前,还可以对该音频数据进行降噪和/或回音消除等过滤处理,即如图1d所示,在步骤“通过DSP对该音频数据进行模糊语音识别”之前,该语音识别方法还可以包括:
对该音频数据进行降噪和/或回音消除处理,得到处理后音频数据。
则此时,步骤“通过DSP对该音频数据进行模糊语音识别”具体可以为:通过DSP对该处理后音频数据进行模糊语音识别。
103、当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU,即由DSP激活CPU的运行程序,比如,具体可以激活CPU中关于录音和音频数据的相关运行程序。
其中,唤醒词可以是一个,也可以是多个,该唤醒词具体可以根据实际应用的需求预先进行设置。比如,以该唤醒词包括“打电话”和“发信息”为例,则当模糊语音识别结果指示该音频数据中存在“打电话”或“发信息”这个词时,便可由DSP唤醒CPU,以此类推,等等。
在步骤“DSP唤醒处于休眠状态的CPU”后,该语音识别方法还可以包括:
通过CPU对该音频数据进行语义分析,并根据分析结果执行相应操作。
例如,具体可以根据分析结果确定操作对象和操作内容,然后,对该操作对象执行该操作内容,等等。
由于DSP的资源有限,语音识别精度不高,因此,为了进一步提高识别的精度,避免误唤醒的情况发生,可选的,在通过CPU对该音频数据进行语义分析之前,还可以由CPU对该音频数据作进一步识别,即在步骤“通过CPU对该音频数据进行语义分析”之前,该语音识别方法还可以包括:
从DSP中读取该音频数据中包含唤醒词的数据,得到唤醒数据,通过该CPU对该唤醒数据进行语音识别,当语音识别结果指示存在唤醒词时,执行通过CPU对该音频数据进行语义分析的步骤,否则,当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并返回执行获取音频数据的步骤(即步骤101)。
可选的,为了节省功耗,CPU在被唤醒时,可以不开启所有核心,而是采用单核和低频来进行运算处理,即步骤“通过该CPU对该唤醒数据进行语音识别”可以包括:
将该CPU的工作状态设置为单核且低频,使得CPU在该工作状态下对该唤醒数据进行语音识别。
其中,为了描述方便,在本发明实施例中,将这种“单核且低频”的工作状态称为第一状态,即CPU可以在该第一状态下,对该唤醒数据进行语音识别。
可选的,为了提高处理效率,当CPU确定存在唤醒词时,可以增加核数,并提升主频来对该音频数据进行语义分析,即步骤“通过CPU对该音频数据进行语义分析”可以包括:
将该CPU的工作状态设置为多核且高频,并在该工作状态下,由CPU对该音频数据进行语义分析。
其中,为了描述方便,在本发明实施例中,将该“多核且高频”的工作状态称为第二状态,即,可以将该CPU的工作状态设置为第二状态,在该第二状态下,对该音频数据进行语义分析。
需说明的是,在本发明实施例中,多核是指在采用处理器中所集成的两个或多个完整的计算引擎(内核);低频指的是主频低于预设频数,高频指的是主频高于等于预设频数,其中,该预设频数可以根据实际应用的需求而定,在此不再赘述。
可选的,为了提高处理的灵活性,使得功耗的消耗和处理效率可以得到更好地均衡,还可以根据具体的语音场景来调整CPU的工作核数和主频大小,即步骤“通过CPU对该音频数据进行语义分析”可以包括:
根据该音频数据对应的唤醒词确定语义场景,根据语义场景确定CPU的工作核数和主频大小,根据该工作核数和主频大小对CPU的工作状态进行设置,得到第三状态,在该第三状态下,对该音频数据进行语义分析。
比如,在“打电话”的语义场景下,可以采用较低的工作核数和主频大小来对该音频数据进行语义分析,而在“搜索”的语义场景下,可以采用较高的工作核数和主频大小来对该音频数据进行语义分析,等等。
由上可知,本实施例在获取到音频数据后,可以通过DSP对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由该DSP唤醒处于休眠状态的CPU,该CPU可以用于对该音频数据进行语义分析。由于该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒;所以,相对于现有方案只能通过外接电源或通过物理按键来唤醒的方案而言,该方案可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。
实施例二、
根据实施例一所描述的方法,以下将举例作进一步详细说明。
在本实施例中,将以该语音识别装置具体集成在移动终端中为例进行说明。
如图2a所示,一种语音识别方法,具体流程可以如下:
201、移动终端通过MIC来采集该音频数据。
其中,该MIC可以独立于该移动终端,也可以内置在该移动终端中。而该音频数据则可以包括各种形式的声音所转换成的数据,该声音的类别可以不做限定,比如,可以是说话声、动物发出的声音、敲打物体的声音、和/或音乐,等等。
202、移动终端通过DSP对该音频数据进行模糊语音识别,若模糊语音识别结果指示存在唤醒词,则执行步骤203,否则,若模糊语音识别结果指示不存在唤醒词,则返回执行步骤201。
其中,唤醒词可以是一个,也可以是多个,该唤醒词具体可以根据实际应用的需求预先进行设置,比如,可以是“打电话”、“发信息”、“*是谁”、“谁是*”、“*是什么”、和/或“什么是*”,等等,其中,“*”可以是任意名词,比如“张三是谁”、“谁是李四”、或“Java是什么”,以此类推,等等。
其中,该DSP可以设置在该移动终端的编码解码器(即Codec)中,比如,如图1a所示。该编码解码器可以对音频数据进行压缩和解压缩(即编码和解编码);当MIC采集到音频数据后,会将该音频数据传送给编码解码器,以进行处理,如进行压缩和/或解压缩等处理,然后,传送给DSP进行模糊语音识别。其中,模糊语音识别的方式可以有多种, 比如,可以采用模糊聚类分析来对该音频数据进行语音识别,或者,也可以采用模糊匹配算法来对该音频数据进行语音识别,等等,例如,具体可以如下:
(1)移动终端通过DSP,采用模糊聚类分析对该音频数据进行语音识别,得到模糊语音识别结果。
比如,DSP具体可以根据模糊聚类分析建立模糊聚类神经网络,然后,将该模糊聚类神经网络作为概率密度函数的估计器,对该音频数据包含唤醒词的概率进行预测,若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果,否则,若预测结果指示概率小于设定值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,该设定值可以根据实际应用的需求进行设置,在此不再赘述。
(2)移动终端通过DSP,采用模糊匹配算法对该音频数据进行语音识别,得到模糊语音识别结果。
比如,DSP具体可以获取唤醒词读音的特征图,得到标准特征图,以及分析该音频数据中各个单词读音的特征图,得到待匹配特征图,然后,根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值,若该程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果,否则,若该程度值小于预设值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,该隶属度函数和预设值可以根据实际应用的需求进行设置,此外,待匹配特征图属于标准特征图的程度也可通过隶属度来表示,隶属度越接近于1,表示该待匹配特征图属于标准特征图的程度越高,隶属度越接近于0,则表示该待匹配特征图属于标准特征图的程度越低,在此不再赘述。
可选的,为了提高语音识别的精度,在通过DSP对该音频数据进行模糊语音识别之前,还可以对该音频数据进行降噪和/或回音消除等过滤处理,即如图2b所示,步骤“移动终端通过DSP对该音频数据进行模糊语音识别”之前,该语音识别方法还可以包括:
移动终端对该音频数据进行降噪和/或回音消除处理,得到处理后音频数据。
则此时,步骤“移动终端通过DSP对该音频数据进行模糊语音识别”具体可以为:移动终端通过DSP对该处理后音频数据进行模糊语音识别。
203、当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU。
例如,具体可以由DSP激活CPU的运行程序,比如,具体可以激活CPU中关于录音和音频数据的相关运行程序,等等。
比如,以该唤醒词包括“打电话”和“发信息”为例,则当模糊语音识别结果指示该音频数据中存在“打电话”或“发信息”这个词时,便可由DSP唤醒CPU,以此类推,等等。
204、移动终端通过DSP读取该音频数据中包含唤醒词的数据,得到唤醒数据。
例如,以唤醒词“打电话”为例,若DSP在对某段音频数据进行语音识别时,确定A段数据存在唤醒词“打电话”,则此时,移动终端可以读取A段数据,将该A段数据作为唤醒数据。
又例如,以唤醒词“发信息”为例,若DSP在对某段音频数据进行语音识别时,确定B段数据存在唤醒词“发信息”,则此时,移动终端可以读取B段数据,将该B段数据作为唤醒数据,以此类推,等等。
205、移动终端通过该CPU对该唤醒数据进行语音识别,当语音识别结果指示存在唤醒词时,执行步骤206,否则,当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并返回执行获取音频数据的步骤(即步骤201。
比如,具体可以通知DSP执行对音频数据进行语音识别的操作,参见图2b。
可选的,为了节省功耗,CPU在被唤醒时,可以不开启所有核心,而是采用单核和低频来进行运算处理,即步骤“通过该CPU对该唤醒数据进行语音识别”可以包括:
将该CPU的工作状态设置为第一状态,即设置为单核且低频,使得CPU在该第一状态下对该唤醒数据进行语音识别。
步骤204和205为可选步骤。
206、移动终端通过CPU对该音频数据进行语义分析。
例如,具体可以将该CPU的工作状态设置为第二状态,即设置为多核且高频,并在该第二状态下,由CPU对该音频数据进行语义分析。
可选的,为了提高处理的灵活性,使得功耗的消耗和处理效率可以得到更好地均衡,还可以根据具体的语音场景来调整CPU的工作核数和主频大小;比如,移动终端可以根据该音频数据对应的唤醒词确定语义场景,然后,根据语义场景确定CPU的工作核数和主频大小,根据该工作核数和主频大小对CPU的工作状态进行设置(即第三状态),并在该工作状态下,对该音频数据进行语义分析。
比如,若“打电话”对应的语义场景下,需要CPU的工作核数为单核,主频大小为X mhz;“发信息”对应的语义场景下,需要CPU的工作核数为单核,主频大小为Y mhz;“搜索”对应的语义场景下,需要CPU的工作核数为双核,主频大小为Z mhz;则具体可以如下:
若唤醒词为“打电话”,则可以将CPU的工作核数设置为单核,且主频大小设置为X mhz,然后,在该工作状态下,由CPU对该音频数据进行语义分析。
若唤醒词为“发信息”,则可以将CPU的工作核数设置为单核,且主频大小设置为Y mhz,然后,在该工作状态下,由CPU对该音频数据进行语义分析。
若唤醒词为“搜索”,则可以将CPU的工作核数设置为双核,且主频大小设置为Z mhz,然后,在该工作状态下,由CPU对该音频数据进行语义分析。
以此类推,等等。
需说明的是,CPU在被唤醒之后,如图2b所示,移动终端还可以通过MIC继续采集其他的音频数据,并由唤醒后的CPU进行语义分析,并根据分析结果执行相应操作,其中,语义分析的方式和“根据分析结果执行相应操作”的方式具体可参见步骤206和207,在此不再赘述。
207、移动终端根据分析结果执行相应操作。
比如,可以根据分析结果确定操作对象和操作内容,然后,通过CPU对该操作对象执行该操作内容,等等。
例如,以“打电话给张三”为例,移动终端可以确定操作对象为“通信录中的张三的电话号码”,操作内容为“拨打电话号码”,因此,此时可以通过CPU拨打通信录中的张三的电话号码,从而完成“打电话给张三”的任务。
又例如,以“搜索诗词”为例,移动终端可以确定操作对象为“搜索引擎应用”,操作内容为“通过搜索引擎应用搜索关键词‘诗词’”,因此,此时可以通过启动该移动终端中的搜索引擎应用,并通过搜索引擎应用搜索关键词‘诗词’,从而完成“搜索诗词”的任务,以此类推,等等。
由上可知,本实施例在获取到音频数据后,可以通过DSP对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由该DSP唤醒处于休眠状态的CPU,由CPU采用单核且低频的工作状态再次对是否存在唤醒词进行确认,若CPU确定不存在唤醒词,则CPU切换至休眠状态,由DSP继续进行监听,只有在CPU确定存在唤醒词时,才由CPU对该音频数据进行语义分析,然后,根据分析结果执行相应操作;由于该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒;所以,相对于现有方案只能通过外接电源或通过物理按键来唤醒的方案而言,该方案可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。
此外,由于该方案除了可以由DSP对唤醒词进行识别之外,还可以由CPU再次对唤醒词进行识别,因此,识别的精度较高,而且,由于CPU在对唤醒词进行识别时,采用的是较低功耗的工作状态(比如单核和低频),只有在确定存在唤醒词时,CPU才会采用较高功耗的工作状态来进行语义分析,因此,资源的利用更为合理有效,有利于进一步改善移动终端的性能。
实施例三、
为了更好地实施以上方法,本发明实施例还提供一种语音识别装置,该语音识别装置具体可以集成在移动终端,比如手机、穿戴式智能设备、平板电脑、和/或笔记本电脑等设备中。
例如,参见图3a,该语音识别装置可以包括获取单元301、模糊识别单元302、唤醒单元303,如下:
(1)获取单元301;
获取单元301,用于获取音频数据。
例如,获取单元301,具体可以用于通过MIC,比如移动终端内置的MIC模块来采集该音频数据。
(2)模糊识别单元302;
模糊识别单元302,用于通过DSP对该音频数据进行模糊语音识别。
其中,模糊语音识别的方式可以有多种,比如,可以采用模糊聚类分析来对该音频数据进行语音识别,或者,也可以采用模糊匹配算法来对该音频数据进行语音识别,等等;即:
第一种方式:
模糊识别单元302,具体可以用于通过DSP,采用模糊聚类分析对该音频数据进行语音识别,得到模糊语音识别结果。
比如,该模糊识别单元302,具体可以用于根据模糊聚类分析建立模糊聚类神经网络,将该模糊聚类神经网络作为概率密度函数的估计器,对该音频数据包含唤醒词的概率进行预测,若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果;若预测结果指示概率小于设定值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,该设定值可以根据实际应用的需求进行设置,在此不再赘述。
第二种方式:
模糊识别单元302,具体可以用于通过DSP,采用模糊匹配算法对该音频数据进行语音识别,得到模糊语音识别结果。
比如,该模糊识别单元302,具体可以用于获取唤醒词读音的特征图,得到标准特征图,分析该音频数据中各个单词读音的特征图,得到待匹配特征图,根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值,若该程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果;若该程度值小于预设值,则生成指示不存在唤醒词的模糊语音识别结果。
其中,该隶属度函数和预设值可以根据实际应用的需求进行设置,在此不再赘述。
在一些实现方式中,所述语音识别装置还可以包括处理单元304,如图3b:
所述处理单元304,用于通过CPU对该音频数据进行语义分析,并根据分析结果执行相应操作。
例如,处理单元304,具体可以用于通过CPU对该音频数据进行语义分析,并根据分析结果确定操作对象和操作内容,然后,对该操作对象执行该操作内容,等等。
可选的,为了提高语音识别的精度,在模糊识别单元302对该音频数据进行模糊语音识别之前,还可以对该音频数据进行降噪和/或回音消除等过滤处理,即如图3c所示,该语音识别装置还可以包括过滤单元305,如下:
过滤单元305,可以用于对该音频数据进行降噪和/或回音消除处理。
则此时,模糊识别单元302,具体可以用于对过滤单元305处理后音频数据进行模糊语音识别。
(3)唤醒单元303;
唤醒单元303,可以用于当模糊语音识别结果指示存在唤醒词时,唤醒处于休眠状态的CPU。
其中,唤醒词可以是一个,也可以是多个,该唤醒词具体可以根据实际应用的需求预先进行设置,在此不再赘述。可选的,为了进一步提高识别的精度,避免误唤醒的情况发生,在处理单元304通过CPU对该音频数据进行语义分析之前,还可以对该音频数据作进一步识别,即如图3c所示,该语音识别装置还可以包括精确识别单元306,如下:
该精确识别单元306,可以用于从DSP中读取该音频数据中包含唤醒词的数据,得到唤醒数据;通过该CPU对该唤醒数据进行语音识别;当语音识别结果指示存在唤醒词时,触发处理单元304执行通过CPU对该音频数据进行语义分析的操作;当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并触发获取单元执行获取音频数据的操作。
可选的,为了节省功耗,CPU在被唤醒时,可以不开启所有核心,而是采用单核和低频来进行运算处理,即:
该精确识别单元306,具体可以用于将该CPU的工作状态设置为第一状态,在该第一状态下,对该唤醒数据进行语音识别,其中,该第一状态为单核且低频。
可选的,为了提高处理效率,当CPU确定存在唤醒词时,可以增加核数,并提升主频来对该音频数据进行语义分析,即:
该处理单元304,具体可以用于将该CPU的工作状态设置为第二状态,在该第二状态下,对该音频数据进行语义分析,其中,该第二状态为多核且高频。
可选的,为了提高处理的灵活性,使得功耗的消耗和处理效率可以得到更好地均衡,还可以根据具体的语音场景来调整CPU的工作核数和主频大小,即:
该处理单元304,具体可以用于根据该音频数据对应的唤醒词确定语义场景,根据语义场景确定CPU的工作核数和主频大小,根据该工作核数和主频大小对CPU的工作状态进行设置,得到第三状态,在该第三状态下,对该音频数据进行语义分析。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施,在此不再赘述。
由上可知,本实施例的语音识别装置在通过获取单元301获取到音频数据后,可以由模糊识别单元302对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由唤醒单元303唤醒处于休眠状态的CPU该CPU可以用于对该音频数据进行语义分析。由于该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒;所以,相对于现有方案只能通过外接电源或通过物理按键来唤醒的方案而言,该方案可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。
实施例四、
相应的,本发明实施例还提供一种移动终端,如图4所示,该移动终端可以包括射频(RF,Radio Frequency)电路401、包括有一个或一个以上计算机可读存储介质的存储器402、输入单元403、显示单元404、传感器405、音频电路406、无线保真(WiFi,Wireless Fidelity)模块407、包括有一个或者一个以上处理核心的处理器408、以及电源409等部件。本领域技术人员可以理解,图4中示出的移动终端结构并不构成对移动终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路401可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器408处理;另外,将涉及上行的数据发送给基站。通常,RF电路401包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM,Subscriber Identity Module)卡、收发信机、耦合器、低噪声放大器(LNA,Low Noise Amplifier)、双工器等。此外,RF电路401还可以通过无线通信与网络和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(GSM,Global System of Mobile communication)、通用分组无线服务(GPRS,General Packet Radio Service)、码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)、长期演进(LTE,Long Term Evolution)、电子邮件、短消息服务(SMS,Short Messaging Service)等。
存储器402可用于存储软件程序以及模块,处理器408通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据移动终端的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器408和输入单元403对存储器402的访问。
输入单元403可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,在一个具体的实施例中,输入单元403可包括触敏表面以及其他输入设备。触敏表面,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面上或在触敏表面附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器408,并能接收处理器408发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面。除了触敏表面,输入单元403还可以包括其他输入设备。具体地,其他输入设备可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元404可用于显示由用户输入的信息或提供给用户的信息以及移动终端的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元404可包括显示面板,可选的,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。进一步的,触敏表面可覆盖显示面板,当触敏表面检测到在其上或附近的触摸操作后,传送给处理器408以确定触摸事件的类型,随后处理器408根据触摸事件的类型在显示面板上提供相应的视觉输出。虽然在图4中,触敏表面与显示面板是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面与显示面板集成而实现输入和输出功能。
移动终端还可包括至少一种传感器405,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板的亮度,接近传感器可在移动终端移动到耳边时,关闭显示面板和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至 于移动终端还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路406、扬声器,传声器可提供用户与移动终端之间的音频接口。音频电路406可将接收到的音频数据转换后的电信号,传输到扬声器,由扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路406接收后转换为音频数据,再将音频数据输出处理器408处理后,经RF电路401以发送给比如另一移动终端,或者将音频数据输出至存储器402以便进一步处理。音频电路406还可能包括耳塞插孔,以提供外设耳机与移动终端的通信。
WiFi属于短距离无线传输技术,移动终端通过WiFi模块407可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图4示出了WiFi模块407,但是可以理解的是,其并不属于移动终端的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器408是移动终端的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行移动终端的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器408可包括一个或多个处理核心;优选的,处理器408可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器408中。
移动终端还包括给各个部件供电的电源409(比如电池),优选的,电源可以通过电源管理系统与处理器408逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源409还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,移动终端还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,移动终端中的处理器408会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中,并由处理器408来运行存储在存储器402中的应用程序,从而实现各种功能:
获取音频数据,通过DSP对该音频数据进行模糊语音识别,当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU,该CPU用于对该音频数据进行语义分析。
CPU被唤醒后,CPU可以对该音频数据进行语义分析,并根据分析结果执行相应操作。
例如,具体可以采用模糊聚类分析或模糊匹配算法来对该音频数据进行语音识别,等等,具体可参见前面的实施例,在此不再赘述。
可选的,为了提高语音识别的精度,在通过DSP对该音频数据进行模糊语音识别之前,还可以对该音频数据进行降噪和/或回音消除等过滤处理,即处理器408还可以运行存储在存储器402中的应用程序,从而实现以下功能:
对该音频数据进行降噪和/或回音消除处理,得到处理后音频数据。
可选的,为了进一步提高识别的精度,避免误唤醒的情况发生,在通过CPU对该音频数据进行语义分析之前,还可以由CPU对该音频数据作进一步识别,即处理器408还可以运行存储在存储器402中的应用程序,从而实现以下功能:
从DSP中读取该音频数据中包含唤醒词的数据,得到唤醒数据,通过该CPU对该唤醒数据进行语音识别,当语音识别结果指示存在唤醒词时,执行通过CPU对该音频数据进行语义分析的操作,否则,当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并返回执行获取音频数据的操作。
以上各个操作的具体实施可参见前面的实施例在,在此不再赘述。
由上可知,本实施例的移动终端在获取到音频数据后,可以通过DSP对该音频数据进行模糊语音识别,当确定存在唤醒词时,才由该DSP唤醒处于休眠状态的CPU,该CPU可以用于对该音频数据进行语义分析。由于该方案采用了运行功耗较低的DSP,代替运行功耗较高的CPU来对音频数据进行监听,因此,CPU无需一直处于被唤醒状态,而是可以处于休眠状态,并在需要时才被唤醒;所以,相对于现有方案只能通过外接电源或通过物理按键来唤醒的方案而言,该方案可以在保留移动性和语音唤醒功能的前提下,大大减少系统功耗,从而延长移动终端的待机时间,改善移动终端的性能。
实施例五、
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本发明实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本发明实施例所提供的任一种语音识别方法中的步骤。例如,该指令可以执行如下步骤:
获取音频数据,通过DSP对该音频数据进行模糊语音识别,当模糊语音识别结果指示存在唤醒词时,由DSP唤醒处于休眠状态的CPU,该CPU用于对该音频数据进行语义分析。
CPU被唤醒后,CPU可以对该音频数据进行语义分析,并根据分析结果执行相应操作。
例如,具体可以采用模糊聚类分析或模糊匹配算法来对该音频数据进行语音识别,等等,具体可参见前面的实施例,在此不再赘述。
可选的,为了提高语音识别的精度,在通过DSP对该音频数据进行模糊语音识别之前,还可以对该音频数据进行降噪和/或回音消除等过滤处理,即该指令还可以执行如下步骤:
对该音频数据进行降噪和/或回音消除处理,得到处理后音频数据。
可选的,为了进一步提高识别的精度,避免误唤醒的情况发生,在通过CPU对该音频数据进行语义分析之前,还可以由CPU对该音频数据作进一步识别,即该指令还可以执行如下步骤:
从DSP中读取该音频数据中包含唤醒词的数据,得到唤醒数据,通过该CPU对该唤醒数据进行语音识别,当语音识别结果指示存在唤醒词时,执行通过CPU对该音频数据进行语义分析的操作,否则,当语音识别结果指示不存在唤醒词时,将CPU设置为休眠,并返回执行获取音频数据的操作。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本发明实施例所提供的任一种语音识别方法中的步骤,因此,可以实现本发明实施例所提供的任一种语音识别方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本发明实施例所提供的一种语音识别方法、装置和存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (17)

  1. 一种语音识别方法,包括:
    获取音频数据;
    通过数字信号处理器对所述音频数据进行模糊语音识别;
    当模糊语音识别结果指示存在唤醒词时,由所述数字信号处理器唤醒处于休眠状态的中央处理器,所述中央处理器用于对所述音频数据进行语义分析。
  2. 根据权利要求1所述的方法,所述通过数字信号处理器对所述音频数据进行模糊语音识别,包括:
    通过数字信号处理器,采用模糊聚类分析对所述音频数据进行语音识别,得到模糊语音识别结果。
  3. 根据权利要求2所述的方法,所述通过数字信号处理器,采用模糊聚类分析对所述音频数据进行语音识别,得到模糊语音识别结果,包括:
    根据模糊聚类分析建立模糊聚类神经网络;
    将所述模糊聚类神经网络作为概率密度函数的估计器,对所述音频数据包含唤醒词的概率进行预测;
    若预测结果指示概率大于等于设定值,则生成指示存在唤醒词的模糊语音识别结果;
    若预测结果指示概率小于所述设定值,则生成指示不存在唤醒词的模糊语音识别结果。
  4. 根据权利要求1所述的方法,所述通过数字信号处理器对所述音频数据进行模糊语音识别,包括:
    通过数字信号处理器,采用模糊匹配算法对所述音频数据进行语音识别,得到模糊语音识别结果。
  5. 根据权利要求4所述的方法,所述通过数字信号处理器,采用模糊匹配算法对所述音频数据进行语音识别,得到模糊语音识别结果,包括:
    获取唤醒词读音的特征图,得到标准特征图;
    分析所述音频数据中各个单词读音的特征图,得到待匹配特征图;
    根据预设的隶属度函数计算各个待匹配特征图属于标准特征图的程度值;
    若所述程度值大于等于预设值,则生成指示存在唤醒词的模糊语音识别结果;
    若所述程度值小于所述预设值,则生成指示不存在唤醒词的模糊语音识别结果。
  6. 根据权利要求1所述的方法,在所述由所述数字信号处理器唤醒处于休眠状态的中央处理器之后,还包括:
    通过所述中央处理器对所述音频数据进行语义分析,并根据分析结果执行所述分析结果相应的操作。
  7. 根据权利要求6所述的方法,所述通过所述中央处理器对所述音频数据进行语义分析之前,还包括:
    从所述数字信号处理器中读取所述音频数据中包含唤醒词的数据,得到唤醒数据;
    通过所述中央处理器对所述唤醒数据进行语音识别;
    当语音识别结果指示存在唤醒词时,执行通过所述中央处理器对所述音频数据进行语义分析的步骤;
    当语音识别结果指示不存在唤醒词时,将所述中央处理器设置为休眠,并返回执行获取音频数据的步骤。
  8. 根据权利要求7所述的方法,所述通过所述中央处理器对所述唤醒数据进行语音识别,包括:
    将所述中央处理器的工作状态设置为第一状态,所述第一状态为单核且低频;
    在所述第一状态下,对所述唤醒数据进行语音识别。
  9. 根据权利要求6至8任一项所述的方法,所述通过所述中央处理器对所述音频数据进行语义分析,包括:
    将所述中央处理器的工作状态设置为第二状态,所述第二状态为多核且高频;
    在所述第二状态下,对所述音频数据进行语义分析。
  10. 根据权利要求6至8任一项所述的方法,所述通过所述中央处理器对所述音频数据进行语义分析,包括:
    根据所述音频数据对应的唤醒词确定语义场景;
    根据语义场景确定所述中央处理器的工作核数和主频大小;
    根据所述工作核数和主频大小对所述中央处理器的工作状态进行设置,得到第三状态;
    在所述第三状态下,对所述音频数据进行语义分析。
  11. 根据权利要求1至8任一项所述的方法,所述通过数字信号处理器对所述音频数据进行模糊语音识别之前,还包括:
    对所述音频数据进行降噪和/或回音消除处理。
  12. 根据权利要求6至8任一项所述的方法,所述根据分析结果执行相应操作,包括:
    根据所述分析结果确定操作对象和操作内容;
    对所述操作对象执行所述操作内容。
  13. 一种语音识别装置,包括:
    获取单元,用于获取音频数据;
    模糊识别单元,用于通过数字信号处理器对所述音频数据进行模糊语音识别;
    唤醒单元,用于当模糊语音识别结果指示存在唤醒词时,唤醒处于休眠状态的中央处理器,所述中央处理器用于对所述音频数据进行语义分析。
  14. 根据权利要求13所述的装置,还包括处理单元:
    所述处理单元,用于通过所述中央处理器对所述音频数据进行语义分析,并根据分析结果执行相应操作。
  15. 根据权利要求13所述的装置,其特征在于,还包括精确识别单元;
    所述精确识别单元,用于从所述数字信号处理器中读取所述音频数据中包含唤醒词的数据,得到唤醒数据;通过所述中央处理器对所述唤醒数据进行语音识别;当语音识别结果指示存在唤醒词时,触发所述处理单元执行通过所述中央处理器对所述音频数据进行语义分析的操作;当语音识别结果指示不存在唤醒词时,将所述中央处理器设置为休眠,并触发所述获取单元执行获取音频数据的操作。
  16. 根据权利要求13-15任一项所述的装置,其特征在于,
    所述处理单元,具体用于根据所述音频数据对应的唤醒词确定语义场景,根据语义场景确定所述中央处理器的工作核数和主频大小,根据所述工作核数和主频大小对所述中央处理器的工作状态进行设置,得到第三状态,在所述第三状态下,对所述音频数据进行语义分析。
  17. 一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至12任一项所述的语音识别方法中的步骤。
PCT/CN2018/091926 2017-07-19 2018-06-20 语音识别方法、装置和存储介质 WO2019015435A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020502569A JP6949195B2 (ja) 2017-07-19 2018-06-20 音声認識方法及び装置、並びに記憶媒体
KR1020207004025A KR102354275B1 (ko) 2017-07-19 2018-06-20 음성 인식 방법 및 장치, 그리고 저장 매체
US16/743,150 US11244672B2 (en) 2017-07-19 2020-01-15 Speech recognition method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710588382.8 2017-07-19
CN201710588382.8A CN107360327B (zh) 2017-07-19 2017-07-19 语音识别方法、装置和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/743,150 Continuation US11244672B2 (en) 2017-07-19 2020-01-15 Speech recognition method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2019015435A1 true WO2019015435A1 (zh) 2019-01-24

Family

ID=60285244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/091926 WO2019015435A1 (zh) 2017-07-19 2018-06-20 语音识别方法、装置和存储介质

Country Status (5)

Country Link
US (1) US11244672B2 (zh)
JP (1) JP6949195B2 (zh)
KR (1) KR102354275B1 (zh)
CN (1) CN107360327B (zh)
WO (1) WO2019015435A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175016A (zh) * 2019-05-29 2019-08-27 英业达科技有限公司 启动语音助理的方法及具有语音助理的电子装置
EP3846162A1 (en) * 2020-01-03 2021-07-07 Baidu Online Network Technology (Beijing) Co., Ltd. Smart audio device, calling method for audio device, electronic device and computer readable medium
CN113223510A (zh) * 2020-01-21 2021-08-06 青岛海尔电冰箱有限公司 冰箱及其设备语音交互方法、计算机可读存储介质
EP3851952A3 (en) * 2020-03-12 2021-08-25 Beijing Baidu Netcom Science And Technology Co. Ltd. Signal processing method, signal processing device, and electronic device
CN117672200A (zh) * 2024-02-02 2024-03-08 天津市爱德科技发展有限公司 一种物联网设备的控制方法、设备及系统

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107360327B (zh) * 2017-07-19 2021-05-07 腾讯科技(深圳)有限公司 语音识别方法、装置和存储介质
CN108337362A (zh) 2017-12-26 2018-07-27 百度在线网络技术(北京)有限公司 语音交互方法、装置、设备和存储介质
CN110164426B (zh) * 2018-02-10 2021-10-26 佛山市顺德区美的电热电器制造有限公司 语音控制方法和计算机存储介质
CN108831477B (zh) * 2018-06-14 2021-07-09 出门问问信息科技有限公司 一种语音识别方法、装置、设备及存储介质
CN109003604A (zh) * 2018-06-20 2018-12-14 恒玄科技(上海)有限公司 一种实现低功耗待机的语音识别方法及系统
CN108986822A (zh) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 语音识别方法、装置、电子设备及非暂态计算机存储介质
CN109686370A (zh) * 2018-12-24 2019-04-26 苏州思必驰信息科技有限公司 基于语音控制进行斗地主游戏的方法及装置
CN111383632B (zh) * 2018-12-28 2023-10-31 北京小米移动软件有限公司 电子设备
CN109886386B (zh) * 2019-01-30 2020-10-27 北京声智科技有限公司 唤醒模型的确定方法及装置
CN109922397B (zh) * 2019-03-20 2020-06-16 深圳趣唱科技有限公司 音频智能处理方法、存储介质、智能终端及智能蓝牙耳机
CN109979438A (zh) * 2019-04-04 2019-07-05 Oppo广东移动通信有限公司 语音唤醒方法及电子设备
CN112015258B (zh) * 2019-05-31 2022-07-15 瑞昱半导体股份有限公司 处理系统与控制方法
CN110265029A (zh) * 2019-06-21 2019-09-20 百度在线网络技术(北京)有限公司 语音芯片和电子设备
CN112207811B (zh) * 2019-07-11 2022-05-17 杭州海康威视数字技术股份有限公司 一种机器人控制方法、装置、机器人及存储介质
WO2021016931A1 (zh) * 2019-07-31 2021-02-04 华为技术有限公司 一种集成芯片以及处理传感器数据的方法
CN110968353A (zh) * 2019-12-06 2020-04-07 惠州Tcl移动通信有限公司 中央处理器的唤醒方法、装置、语音处理器以及用户设备
CN111071879A (zh) * 2020-01-01 2020-04-28 门鑫 电梯楼层登记方法、装置及存储介质
CN113628616A (zh) * 2020-05-06 2021-11-09 阿里巴巴集团控股有限公司 音频采集设备、无线耳机以及电子设备系统
CN111679861A (zh) * 2020-05-09 2020-09-18 浙江大华技术股份有限公司 电子设备的唤醒装置、方法和计算机设备和存储介质
CN113760218A (zh) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 数据处理方法、装置、电子设备及计算机存储介质
CN111696553B (zh) * 2020-06-05 2023-08-22 北京搜狗科技发展有限公司 一种语音处理方法、装置及可读介质
US11877237B2 (en) * 2020-06-15 2024-01-16 TriSpace Technologies (OPC) Pvt. Ltd. System and method for optimizing power consumption in multimedia signal processing in mobile devices
CN111755002B (zh) * 2020-06-19 2021-08-10 北京百度网讯科技有限公司 语音识别装置、电子设备和语音识别方法
CN111833870A (zh) * 2020-07-01 2020-10-27 中国第一汽车股份有限公司 车载语音系统的唤醒方法、装置、车辆和介质
CN112133302B (zh) * 2020-08-26 2024-05-07 北京小米松果电子有限公司 预唤醒终端的方法、装置及存储介质
CN111986671B (zh) * 2020-08-28 2024-04-05 京东科技信息技术有限公司 服务机器人及其语音开关机方法和装置
CN112216283B (zh) * 2020-09-24 2024-02-23 建信金融科技有限责任公司 一种语音识别方法、装置、设备及存储介质
CN112698872A (zh) * 2020-12-21 2021-04-23 北京百度网讯科技有限公司 语音数据处理的方法、装置、设备及存储介质
CN216145422U (zh) * 2021-01-13 2022-03-29 神盾股份有限公司 语音助理系统
CN113053360A (zh) * 2021-03-09 2021-06-29 南京师范大学 一种精准度高的基于语音软件识别方法
CN113297363A (zh) * 2021-05-28 2021-08-24 安徽领云物联科技有限公司 智能语义交互机器人系统
CN113393838A (zh) * 2021-06-30 2021-09-14 北京探境科技有限公司 语音处理方法、装置、计算机可读存储介质及计算机设备
CN117253488A (zh) * 2022-06-10 2023-12-19 Oppo广东移动通信有限公司 语音识别方法、装置、设备及存储介质
CN118506774A (zh) * 2023-02-15 2024-08-16 Oppo广东移动通信有限公司 语音唤醒方法、装置、电子设备、存储介质及产品
CN116822529B (zh) * 2023-08-29 2023-12-29 国网信息通信产业集团有限公司 基于语义泛化的知识要素抽取方法
CN117524228A (zh) * 2024-01-08 2024-02-06 腾讯科技(深圳)有限公司 语音数据处理方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866274A (zh) * 2014-12-01 2015-08-26 联想(北京)有限公司 信息处理方法及电子设备
CN105723451A (zh) * 2013-12-20 2016-06-29 英特尔公司 从低功率始终侦听模式到高功率语音识别模式的转换
CN106356059A (zh) * 2015-07-17 2017-01-25 中兴通讯股份有限公司 语音控制方法、装置及投影仪设备
CN107360327A (zh) * 2017-07-19 2017-11-17 腾讯科技(深圳)有限公司 语音识别方法、装置和存储介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2906605B2 (ja) * 1990-07-12 1999-06-21 松下電器産業株式会社 パターン認識装置
JPH06149286A (ja) * 1992-11-10 1994-05-27 Clarion Co Ltd 不特定話者音声認識装置
JP2004045900A (ja) * 2002-07-12 2004-02-12 Toyota Central Res & Dev Lab Inc 音声対話装置及びプログラム
US9117449B2 (en) * 2012-04-26 2015-08-25 Nuance Communications, Inc. Embedded system for construction of small footprint speech recognition with user-definable constraints
CN102866921B (zh) 2012-08-29 2016-05-11 惠州Tcl移动通信有限公司 一种多核cpu的调控方法及系统
US10304465B2 (en) * 2012-10-30 2019-05-28 Google Technology Holdings LLC Voice control user interface for low power mode
US9704486B2 (en) 2012-12-11 2017-07-11 Amazon Technologies, Inc. Speech recognition power management
CN105575395A (zh) * 2014-10-14 2016-05-11 中兴通讯股份有限公司 语音唤醒方法及装置、终端及其处理方法
KR102299330B1 (ko) * 2014-11-26 2021-09-08 삼성전자주식회사 음성 인식 방법 및 그 전자 장치
JP6501217B2 (ja) 2015-02-16 2019-04-17 アルパイン株式会社 情報端末システム
GB2535766B (en) * 2015-02-27 2019-06-12 Imagination Tech Ltd Low power detection of an activation phrase
CN105976808B (zh) * 2016-04-18 2023-07-25 成都启英泰伦科技有限公司 一种智能语音识别系统及方法
CN106020987A (zh) * 2016-05-31 2016-10-12 广东欧珀移动通信有限公司 处理器中内核运行配置的确定方法以及装置
US20180293974A1 (en) * 2017-04-10 2018-10-11 Intel IP Corporation Spoken language understanding based on buffered keyword spotting and speech recognition
US10311870B2 (en) * 2017-05-10 2019-06-04 Ecobee Inc. Computerized device with voice command input capability

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105723451A (zh) * 2013-12-20 2016-06-29 英特尔公司 从低功率始终侦听模式到高功率语音识别模式的转换
CN104866274A (zh) * 2014-12-01 2015-08-26 联想(北京)有限公司 信息处理方法及电子设备
CN106356059A (zh) * 2015-07-17 2017-01-25 中兴通讯股份有限公司 语音控制方法、装置及投影仪设备
CN107360327A (zh) * 2017-07-19 2017-11-17 腾讯科技(深圳)有限公司 语音识别方法、装置和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIU, YUHONG ET AL.: "Speech Recognition Based on Fuzzy Clustering Neural Network", CHINESE JOURNAL OF COMPUTERS, vol. 29, no. 10, 30 October 2006 (2006-10-30), pages 1894 - 1900, ISSN: 0254-4164 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175016A (zh) * 2019-05-29 2019-08-27 英业达科技有限公司 启动语音助理的方法及具有语音助理的电子装置
EP3846162A1 (en) * 2020-01-03 2021-07-07 Baidu Online Network Technology (Beijing) Co., Ltd. Smart audio device, calling method for audio device, electronic device and computer readable medium
JP2021110945A (ja) * 2020-01-03 2021-08-02 バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド スマートオーディオ装置、方法、電子デバイスおよびコンピュータ可読媒体
CN113223510A (zh) * 2020-01-21 2021-08-06 青岛海尔电冰箱有限公司 冰箱及其设备语音交互方法、计算机可读存储介质
CN113223510B (zh) * 2020-01-21 2022-09-20 青岛海尔电冰箱有限公司 冰箱及其设备语音交互方法、计算机可读存储介质
EP3851952A3 (en) * 2020-03-12 2021-08-25 Beijing Baidu Netcom Science And Technology Co. Ltd. Signal processing method, signal processing device, and electronic device
CN117672200A (zh) * 2024-02-02 2024-03-08 天津市爱德科技发展有限公司 一种物联网设备的控制方法、设备及系统
CN117672200B (zh) * 2024-02-02 2024-04-16 天津市爱德科技发展有限公司 一种物联网设备的控制方法、设备及系统

Also Published As

Publication number Publication date
US11244672B2 (en) 2022-02-08
CN107360327B (zh) 2021-05-07
JP6949195B2 (ja) 2021-10-13
KR102354275B1 (ko) 2022-01-21
KR20200027554A (ko) 2020-03-12
US20200152177A1 (en) 2020-05-14
CN107360327A (zh) 2017-11-17
JP2020527754A (ja) 2020-09-10

Similar Documents

Publication Publication Date Title
US11244672B2 (en) Speech recognition method and apparatus, and storage medium
WO2017206916A1 (zh) 处理器中内核运行配置的确定方法以及相关产品
WO2018032581A1 (zh) 一种应用程序控制方法及装置
WO2017063604A1 (zh) 消息推送方法及移动终端和消息推送服务器
WO2017206915A1 (zh) 处理器中内核运行配置的确定方法以及相关产品
CN105630846B (zh) 头像更新方法及装置
WO2015081664A1 (zh) 控制无线网络开关方法、装置、设备及系统
CN103699409A (zh) 一种电子设备切入唤醒状态的方法、装置和系统
CN104375886A (zh) 信息处理方法、装置和电子设备
CN109389977B (zh) 一种语音交互方法及装置
WO2017206860A1 (zh) 移动终端的处理方法及移动终端
WO2017206918A1 (zh) 终端加速唤醒方法以及相关产品
CN111443803A (zh) 模式切换方法、装置、存储介质及移动终端
CN110543333B (zh) 针对处理器的休眠处理方法、装置、移动终端和存储介质
CN115985323B (zh) 语音唤醒方法、装置、电子设备及可读存储介质
CN111027406B (zh) 图片识别方法、装置、存储介质及电子设备
WO2018214745A1 (zh) 应用控制方法及相关产品
CN110277097B (zh) 数据处理方法及相关设备
CN109062643A (zh) 一种显示界面调整方法、装置及终端
CN116486833B (zh) 音频增益调整方法、装置、存储介质及电子设备
CN113254088A (zh) 功能程序唤醒方法、终端及存储介质
CN111897916A (zh) 语音指令识别方法、装置、终端设备及存储介质
CN111580911A (zh) 一种终端的操作提示方法、装置、存储介质及终端
WO2015067206A1 (zh) 一种文件查找的方法及终端
CN112433694B (zh) 光强度调整方法及装置、存储介质和动终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18835410

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020502569

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20207004025

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 18835410

Country of ref document: EP

Kind code of ref document: A1