WO2021169711A1 - 指令执行方法、装置、存储介质及电子设备 - Google Patents

指令执行方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2021169711A1
WO2021169711A1 PCT/CN2021/073831 CN2021073831W WO2021169711A1 WO 2021169711 A1 WO2021169711 A1 WO 2021169711A1 CN 2021073831 W CN2021073831 W CN 2021073831W WO 2021169711 A1 WO2021169711 A1 WO 2021169711A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
processor
voice
instruction
model
Prior art date
Application number
PCT/CN2021/073831
Other languages
English (en)
French (fr)
Inventor
陈喆
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to EP21759843.2A priority Critical patent/EP4095850A1/en
Publication of WO2021169711A1 publication Critical patent/WO2021169711A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of speech processing, and in particular to an instruction execution method, device, storage medium, and electronic equipment.
  • users can speak a wake-up word to wake up the electronic device, and speak a voice command to control the electronic device to perform a specific operation without directly controlling the electronic device.
  • the embodiments of the present application provide an instruction execution method, device, storage medium, and electronic equipment, which can improve the ease of use of voice control and reduce the power consumption of the electronic equipment for voice wake-up.
  • an embodiment of the application provides an instruction execution method, which is applied to an electronic device, the electronic device includes a processor, a dedicated voice recognition chip, and a microphone, and the power consumption of the dedicated voice recognition chip is less than that of the processing
  • the instruction execution method includes:
  • the processor verifies the first audio data, and if the verification passes, the processor controls the microphone to perform audio collection to obtain second audio data;
  • the pre-trained instruction recognition model is invoked by the processor to recognize the voice instruction carried in the second audio data, and execute the voice instruction.
  • an embodiment of the present application provides an instruction execution device, which is applied to an electronic device, the electronic device includes a processor, a dedicated voice recognition chip, and a microphone, and the power consumption of the dedicated voice recognition chip is less than that of the processing
  • the instruction execution device includes:
  • An audio collection module configured to control the microphone to perform audio collection through the dedicated voice recognition chip when the processor is in a dormant state to obtain first audio data
  • the first verification module is configured to verify the first audio data through the dedicated voice recognition chip, and if the verification passes, wake up the processor;
  • the second verification module is configured to verify the first audio data through the processor, and if the verification passes, control the microphone through the processor to perform audio collection to obtain second audio data;
  • the instruction execution module is configured to use the processor to call a pre-trained instruction recognition model to recognize the voice instruction carried by the second audio data, and execute the voice instruction.
  • the embodiments of the present application provide a storage medium on which a computer program is stored.
  • the computer program runs on an electronic device including a processor, a dedicated voice recognition chip, and a microphone
  • the electronic device executes
  • the power consumption of the dedicated speech recognition chip is less than the power consumption of the processor.
  • an embodiment of the present application also provides an electronic device, the electronic device includes an audio collection unit, a processor, a dedicated voice recognition chip, and a microphone, and the power consumption of the dedicated voice recognition chip is less than that of the processor Power consumption, where,
  • the dedicated voice recognition chip is used to control the external first audio data collected by the microphone when the processor is in a sleep state;
  • the processor is configured to verify the first audio data, and if the verification passes, then control the external second audio data collected by the microphone;
  • the pre-trained instruction recognition model is invoked to recognize the voice instruction carried in the second audio data, and execute the voice instruction.
  • FIG. 1 is a schematic flowchart of an instruction execution method provided by an embodiment of the present application.
  • Fig. 2 is a schematic diagram of calling a first-level text verification model in an embodiment of the present application.
  • FIG. 3 is another schematic flowchart of the instruction execution method provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an instruction execution device provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the embodiment of the application first provides an instruction execution method.
  • the execution subject of the instruction execution method may be the electronic device provided in the embodiment of the application.
  • the electronic device includes a processor, a dedicated voice recognition chip, and a microphone.
  • the power consumption is less than the power consumption of the processor, and the electronic device may be a device equipped with a processor such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer that has processing capability.
  • the present application provides an instruction execution method, which is applied to an electronic device, the electronic device includes a processor, a dedicated voice recognition chip, and a microphone, and the power consumption of the dedicated voice recognition chip is less than the power consumption of the processor, the The instruction execution method includes:
  • the processor verifies the first audio data, and if the verification passes, the processor controls the microphone to perform audio collection to obtain second audio data;
  • the pre-trained instruction recognition model is invoked by the processor to recognize the voice instruction carried in the second audio data, and execute the voice instruction.
  • the instruction recognition model includes a plurality of instruction recognition models corresponding to different voice instructions, and the pre-trained instruction recognition model is invoked by the processor to recognize that the second audio data carries Voice commands, including:
  • Scoring the second audio data by invoking each instruction recognition model by the processor
  • the voice instruction corresponding to the instruction recognition model with the highest score is used as the voice instruction carried in the second audio data.
  • the using the voice instruction corresponding to the instruction recognition model with the highest score as the voice instruction carried in the second audio data includes:
  • the voice instruction corresponding to the instruction recognition model with the highest score and reaching the preset score is used as the voice instruction carried in the second audio data.
  • the verifying the first audio data through the dedicated voice recognition chip includes:
  • a pre-trained first-level text verification model corresponding to the scene classification result is invoked through the dedicated voice recognition chip to verify whether the first audio data includes a preset wake-up word.
  • the verifying the first audio data by the processor includes:
  • a pre-trained second-level voiceprint verification model is invoked by the processor, wherein the second-level voiceprint verification model is preset by the user
  • the sample voice training of the preset wake-up words is obtained;
  • it further includes:
  • the processor uses the sample voice to perform adaptive training on the general verification model to obtain the secondary voiceprint verification model.
  • the method further includes:
  • the method further includes:
  • the voice instruction carried in the second audio data is recognized by the voice interaction application, and the recognition capability of the voice interaction application is greater than the recognition capability of the instruction recognition model.
  • it further includes:
  • FIG. 1 is a schematic flowchart of an instruction execution method provided by an embodiment of the application.
  • the instruction execution method is applied to the electronic device provided in this application.
  • the electronic device includes a processor, a dedicated voice recognition chip, and a microphone. As shown in FIG.
  • the microphone is controlled by a dedicated voice recognition chip to perform audio collection to obtain the first audio data.
  • the dedicated voice recognition chip in the embodiment of the application is a dedicated chip designed for voice recognition, such as a digital signal processing chip designed for voice, and a dedicated integrated circuit designed for voice Chips, etc., which have lower power consumption than general-purpose processors.
  • the dedicated voice recognition chip and the processor establish a communication connection through a communication bus (such as an I2C bus) to realize data interaction.
  • the processor sleeps when the screen of the electronic device is in the off-screen state
  • the dedicated voice recognition chip sleeps when the screen is in the on-screen state.
  • the microphone included in the electronic device may be a built-in microphone or an external microphone (which may be a wired microphone or a wireless microphone).
  • the dedicated voice recognition chip when the processor is in the dormant state (the dedicated voice recognition chip is in the awake state), the dedicated voice recognition chip controls the microphone to collect external sounds, and the collected audio data is recorded as the first audio data.
  • the first audio data is verified by a dedicated voice recognition chip, and if the verification is passed, the processor is awakened.
  • the verification of the first audio data by the dedicated voice recognition chip includes, but is not limited to, verification of the text feature and/or voiceprint feature of the first audio data.
  • verifying the text feature of the first audio data is to verify whether the first audio data includes a preset wake-up word, as long as the first audio data includes the preset wake-up word, that is, to verify the text of the first audio data
  • the feature is passed, regardless of who said the presupposed wake word.
  • the first audio data includes a preset wake-up word set by a preset user (for example, the owner of the electronic device, or other users authorized by the owner to use the electronic device), but the preset wake-up word is spoken by user A,
  • the dedicated voice recognition chip will pass the verification when the text feature of the first audio data is verified by the first verification algorithm.
  • To verify the text feature and voiceprint feature of the first audio data is to verify whether the first audio data includes the preset wake-up word spoken by the preset user, if the first audio data includes the preset wake-up word spoken by the preset user If the wake word is preset, the text feature and voiceprint feature of the first audio data pass the verification, otherwise the verification fails.
  • the text feature and voiceprint feature of the first audio data pass;
  • the first audio data includes preset wake-up words spoken by users other than the preset user, or the first audio data does not include any preset wake-up words spoken by the user, the text feature of the first audio data And the voiceprint feature will fail the verification (or fail the verification).
  • the dedicated voice recognition chip sends a preset interrupt signal to the processor through the communication connection between the dedicated voice recognition chip and the processor when verifying that the first audio data passes, so as to wake up the processor.
  • the dedicated voice recognition chip continues to control the microphone for audio collection until the first audio data passes the verification.
  • the processor verifies the first audio data, and if the verification passes, the processor controls the microphone to perform audio collection to obtain the second audio data.
  • the dedicated voice recognition chip wakes up the processor, it also provides the first audio data to the processor, and the processor verifies the first audio data again.
  • the dedicated voice recognition chip can provide the first audio data to the processor through the SoundTrigger framework.
  • the verification of the first audio data by the processor includes, but is not limited to, verification of the text feature and/or voiceprint feature of the aforementioned first audio data.
  • the processor When the processor passes the verification of the first audio data, it controls the microphone to perform audio collection, and records the collected audio data as the second audio data.
  • the processor also switches the screen to a bright screen state.
  • the pre-trained instruction recognition model is called by the processor to recognize the voice instruction carried in the second audio data, and execute the aforementioned voice instruction.
  • a machine learning algorithm is also used to train an instruction recognition model in advance, and the instruction recognition model is configured to recognize voice instructions carried by input audio data.
  • the processor calls the pre-trained instruction recognition model, inputs the second audio data into the instruction recognition model for recognition, obtains the voice instruction carried by the second audio data, and executes the voice instruction .
  • the processor starts the voice interactive application to perform more complex voice interaction with the user through the voice interactive application.
  • the processor starts a default music player for the user to play desired music.
  • the special voice recognition chip with lower power consumption than the processor controls the microphone for audio collection to obtain the first audio data; then the first audio data is verified by the special voice recognition chip, If the verification is passed, the processor is awakened; the first audio data is verified by the processor; if the verification is passed, the processor controls the microphone for audio collection to obtain the second audio data; finally, the processor calls the pre-trained The instruction recognition model recognizes the voice instruction carried by the second audio data, and executes the voice instruction.
  • the power consumption of the electronic device for voice wake-up can be reduced, and at the same time, since there is no need to start a voice interactive application to realize voice command recognition, the ease of use of voice control is also improved.
  • the instruction recognition model includes a plurality of instruction recognition models corresponding to different voice instructions, and the pre-trained instruction recognition model is called by the processor to recognize the voice instructions carried in the second audio data, including:
  • the processor calls each instruction recognition model to score the second audio data
  • voice commands such as "play music”, “open WeChat”, "start voice interactive application”, and so on.
  • sample voice including the voice command is collected, and its spectrogram is extracted.
  • a convolutional neural network is used to train the extracted spectrogram to obtain an instruction recognition model corresponding to the speech.
  • multiple command recognition models corresponding to different voice commands can be trained, such as the command recognition model corresponding to "play music”, the command recognition model corresponding to "open WeChat”, the command recognition model corresponding to "start voice interactive application”, etc. .
  • the processor calls each command recognition model to score the second audio data.
  • the score reflects the probability that the second audio data carries a certain voice command. The higher the score of the recognition model, the higher the probability that the second audio data carries the voice command corresponding to the command recognition model.
  • the processor may use the voice instruction corresponding to the instruction recognition model with the highest score as the voice instruction carried in the second audio data.
  • the processor may also use the voice command corresponding to the command recognition model with the highest score and reaching a preset score as the voice command carried by the second audio data.
  • verifying the first audio data through a dedicated voice recognition chip includes:
  • the first-level verification performed by a dedicated voice recognition chip including the verification of text features is taken as an example for description.
  • a machine learning algorithm is used to pre-train a scene classification model based on sample voices of different known scenes, and the scene classification model can be used to classify the scene where the electronic device is located.
  • a first-level text verification model set is preset in the electronic device, and the first-level text verification model set includes a plurality of first-level text verification corresponding to preset wake words that are trained in different scenarios in advance.
  • the model is adapted to be loaded by a dedicated voice recognition chip in different scenarios, so as to more flexibly and accurately check whether the first audio data includes a preset wake-up word.
  • the electronic device calls the first-level text verification model corresponding to the scene classification result from the first-level text verification model set through the dedicated voice recognition chip, and passes the The first-level text verification model verifies whether the first audio data includes a preset wake-up word, if yes, the verification passes, otherwise the verification fails.
  • the first-level text verification model set includes four first-level text verification models, which are the first-level text verification model A suitable for audio verification in the A scene and the first The first-level text verification model B for audio verification, the first-level text verification model C for audio verification in the C scene, and the first-level text verification model D for audio verification in the D scene.
  • the electronic device loads the first-level text verification model B from the first-level text verification model set through the dedicated voice recognition chip;
  • the electronic device loads the first-level text verification model B from the first-level text verification model set through a dedicated voice recognition chip, and so on.
  • verifying the first audio data by the processor includes:
  • the pre-trained second-level voiceprint verification model is invoked by the processor, where the second-level voiceprint verification model speaks the preset wake-up word through the preset user Sample voice training obtained;
  • the description is made by taking the processor to perform the verification of the text feature and the voiceprint feature as an example.
  • the pre-trained second-level text verification model corresponding to the preset wake-up word is first invoked by the processor, and the second-level text verification model is used to verify whether the preset wake-up word is included in the first audio data.
  • the second-level text verification model can be obtained by training with a scoring function, where the scoring function is used to map a vector to a value, and this is a constraint.
  • a scoring function is used to map a vector to a value, and this is a constraint.
  • a person of ordinary skill in the art can select a suitable function as the scoring function according to actual needs. The embodiment of the present invention does not specifically limit this.
  • the secondary text verification model When using the secondary text verification model to verify whether the first audio data includes a preset wake word, first extract a feature vector that can characterize the first audio data, and input the feature vector into the secondary text verification model for scoring , Get the corresponding score. Then, the scoring score is compared with the discriminant score corresponding to the secondary text verification model, and if the scoring score reaches the discriminant score corresponding to the secondary text verification model, it is determined that the first audio data includes the preset wake word.
  • the pre-trained second-level voiceprint verification model is further invoked through the processor, and the second-level voiceprint verification model speaks the preset through the preset user The sample voice training of the wake word is obtained. Then, the second-level voiceprint verification model is used to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice.
  • the second-level voiceprint verification model can be obtained by further training the second-level text verification model through sample voices.
  • the secondary voiceprint verification model to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice, first extract a feature vector that can characterize the first audio data, and input the feature vector into the secondary Scoring is performed in the voiceprint verification model, and the corresponding scoring value is obtained. Then, compare the scoring score with the discriminant score corresponding to the secondary voiceprint verification model.
  • the scoring score reaches the discriminant score corresponding to the secondary voiceprint verification model, it is determined that the voiceprint feature of the first audio data is consistent with If the voiceprint feature of the sample voice matches, it is judged that the verification is passed at this time, otherwise it is judged that the verification fails.
  • the instruction execution method provided in this application further includes:
  • the processor controls the microphone to collect the sample voice of the preset user uttering the preset wake-up word
  • the processor uses the sample voice to adaptively train the general verification model to obtain the secondary voiceprint verification model.
  • the pre-trained general verification model corresponding to the preset wake word may be obtained through the processor, and the general verification model is set as the secondary text verification model.
  • the processor controls the microphone to collect the sample voice of the preset user uttering the preset wake-up word. Then, the processor extracts the acoustic features of the sample speech, and adaptively trains the acoustic features through the general verification model, and sets the general verification model after adaptive training as the second-level voiceprint verification model.
  • the adaptive training can be realized by using a maximum a posteriori estimation algorithm.
  • the dedicated voice recognition chip after waking up the processor, can be controlled to sleep to save power.
  • the method further includes:
  • the instruction recognition model in this application has weaker recognition capabilities than voice interactive applications, and is suitable for the execution of shortcut operations. Therefore, there may be a situation where the recognition of the instruction recognition model fails (it may be that the instruction recognition model is not recognized, or it may be that there is no voice instruction in the second audio data).
  • the processor initiates a voice with stronger recognition ability in the background Interactive application, and recognize the voice command carried in the second audio data through the voice interactive application, and if recognized, execute the recognized voice command.
  • FIG. 3 is a schematic diagram of another flow of an instruction execution method provided by an embodiment of the application.
  • the instruction execution method is applied to the electronic device provided in this application, and the electronic device includes a processor, a dedicated voice recognition chip, and a microphone.
  • the instruction execution method provided in this embodiment of the application may have the following process:
  • the microphone is controlled by a dedicated voice recognition chip to perform audio collection to obtain first audio data.
  • the dedicated voice recognition chip in the embodiment of the application is a dedicated chip designed for voice recognition, such as a digital signal processing chip designed for voice, and a dedicated integrated circuit designed for voice Chips, etc., which have lower power consumption than general-purpose processors.
  • the dedicated voice recognition chip and the processor establish a communication connection through a communication bus (such as an I2C bus) to realize data interaction.
  • the processor sleeps when the screen of the electronic device is in the off-screen state
  • the dedicated voice recognition chip sleeps when the screen is in the on-screen state.
  • the microphone included in the electronic device may be a built-in microphone or an external microphone (which may be a wired microphone or a wireless microphone).
  • the dedicated voice recognition chip when the processor is in the dormant state (the dedicated voice recognition chip is in the awake state), the dedicated voice recognition chip controls the microphone to collect external sounds, and the collected audio data is recorded as the first audio data.
  • a pre-trained scene classification model is invoked through a dedicated speech recognition chip to classify the first audio data to obtain a scene classification result.
  • the pre-trained first-level text verification model corresponding to the scene classification result is called through a dedicated voice recognition chip to verify whether the first audio data includes a preset wake-up word, and if the verification passes, the processor is waken up.
  • the first-level verification performed by a dedicated voice recognition chip including the verification of text features is taken as an example for description.
  • a machine learning algorithm is used to pre-train a scene classification model based on sample voices of different known scenes, and the scene classification model can be used to classify the scene where the electronic device is located.
  • a first-level text verification model set is preset in the electronic device, and the first-level text verification model set includes a plurality of first-level text verification corresponding to preset wake words that are trained in different scenarios in advance.
  • the model is adapted to be loaded by a dedicated voice recognition chip in different scenarios, so as to more flexibly and accurately check whether the first audio data includes a preset wake-up word.
  • the electronic device calls the first-level text verification model corresponding to the scene classification result from the first-level text verification model set through the dedicated voice recognition chip, and passes the The first-level text verification model verifies whether the first audio data includes a preset wake-up word, if yes, the verification passes, otherwise the verification fails.
  • the first-level text verification model set includes four first-level text verification models, which are the first-level text verification model A suitable for audio verification in the A scene and the first The first-level text verification model B for audio verification, the first-level text verification model C for audio verification in the C scene, and the first-level text verification model D for audio verification in the D scene.
  • the electronic device loads the first-level text verification model B from the first-level text verification model set through the dedicated voice recognition chip;
  • the electronic device loads the first-level text verification model B from the first-level text verification model set through a dedicated voice recognition chip, and so on.
  • the dedicated voice recognition chip sends a preset interrupt signal to the processor through the communication connection between the dedicated voice recognition chip and the processor when verifying that the first audio data passes, so as to wake up the processor.
  • the dedicated voice recognition chip continues to control the microphone for audio collection until the first audio data passes the verification.
  • the pre-trained secondary text verification model corresponding to the preset wake-up word is invoked by the processor to verify whether the preset wake-up word is included in the first audio data.
  • the dedicated voice recognition chip wakes up the processor, it also provides the first audio data to the processor, and the processor verifies the first audio data again.
  • the dedicated voice recognition chip can provide the first audio data to the processor through the SoundTrigger framework.
  • the description is made by taking the processor to perform the verification of the text feature and the voiceprint feature as an example.
  • the pre-trained second-level text verification model corresponding to the preset wake-up word is first invoked by the processor, and the second-level text verification model is used to verify whether the preset wake-up word is included in the first audio data.
  • the second-level text verification model can be obtained by training with a scoring function, where the scoring function is used to map a vector to a value, and this is a constraint.
  • a scoring function is used to map a vector to a value, and this is a constraint.
  • a person of ordinary skill in the art can select a suitable function as the scoring function according to actual needs. The embodiment of the present invention does not specifically limit this.
  • the secondary text verification model When using the secondary text verification model to verify whether the first audio data includes a preset wake word, first extract a feature vector that can characterize the first audio data, and input the feature vector into the secondary text verification model for scoring , Get the corresponding score. Then, the scoring score is compared with the discriminant score corresponding to the secondary text verification model, and if the scoring score reaches the discriminant score corresponding to the secondary text verification model, it is determined that the first audio data includes the preset wake word.
  • the pre-trained second-level voiceprint verification model is invoked by the processor, wherein the second-level voiceprint verification model speaks the preset wake-up call by the preset user
  • the sample voice training of the word is obtained.
  • the second-level voiceprint verification model is used to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice. If the verification passes, the processor controls the microphone for audio collection to obtain the second Audio data.
  • the pre-trained second-level voiceprint verification model is further invoked through the processor, and the second-level voiceprint verification model speaks the preset through the preset user The sample voice training of the wake word is obtained. Then, the second-level voiceprint verification model is used to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice.
  • the second-level voiceprint verification model can be obtained by further training the second-level text verification model through sample voices.
  • the secondary voiceprint verification model to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice, first extract a feature vector that can characterize the first audio data, and input the feature vector into the secondary Scoring is performed in the voiceprint verification model, and the corresponding scoring value is obtained. Then, compare the scoring score with the discriminant score corresponding to the secondary voiceprint verification model.
  • the scoring score reaches the discriminant score corresponding to the secondary voiceprint verification model, it is determined that the voiceprint feature of the first audio data is consistent with If the voiceprint feature of the sample voice matches, it is judged that the verification is passed at this time, otherwise it is judged that the verification fails.
  • the processor When the processor passes the verification of the first audio data, it controls the microphone to perform audio collection, and records the collected audio data as the second audio data.
  • the processor also switches the screen to a bright screen state.
  • the pre-trained instruction recognition model is called by the processor to recognize the voice instruction carried in the second audio data, and execute the voice instruction.
  • a machine learning algorithm is also used to train an instruction recognition model in advance, and the instruction recognition model is configured to recognize voice instructions carried by input audio data.
  • the processor calls the pre-trained instruction recognition model, inputs the second audio data into the instruction recognition model for recognition, obtains the voice instruction carried by the second audio data, and executes the voice instruction .
  • the processor starts the voice interactive application to perform more complex voice interaction with the user through the voice interactive application.
  • the processor starts a default music player for the user to play desired music.
  • FIG. 4 is a schematic structural diagram of an instruction execution device provided by an embodiment of the application.
  • the instruction execution device can be applied to an electronic device that includes a processor, a dedicated voice recognition chip, and a microphone, and the power consumption of the dedicated voice recognition chip is less than the power consumption of the processor.
  • the instruction execution device may include an audio collection module 401, a first verification module 402, a second verification module 403, and an instruction execution module 404, where:
  • the audio collection module 401 is used to control the microphone to perform audio collection through a dedicated voice recognition chip when the processor is in the dormant state to obtain the first audio data;
  • the first verification module 402 is configured to verify the first audio data through a dedicated voice recognition chip, and if the verification passes, wake up the processor;
  • the second verification module 403 is configured to verify the first audio data by the processor, and if the verification passes, the processor controls the microphone to perform audio collection to obtain the second audio data;
  • the instruction execution module 404 is configured to use the processor to call the pre-trained instruction recognition model to recognize the voice instruction carried by the second audio data, and execute the aforementioned voice instruction.
  • the instruction recognition model includes a plurality of instruction recognition models corresponding to different voice instructions.
  • the instruction execution module 404 is configured to:
  • the processor calls each instruction recognition model to score the second audio data
  • the voice command corresponding to the command recognition model with the highest score is used as the voice command carried in the second audio data.
  • the instruction execution module 404 is configured to:
  • the voice instruction corresponding to the instruction recognition model with the highest score and reaching the preset score is used as the voice instruction carried in the second audio data.
  • the first verification module 402 when the first audio data is verified through a dedicated voice recognition chip, the first verification module 402 is used to:
  • a pre-trained first-level text verification model corresponding to the scene classification result is called through a dedicated voice recognition chip to verify whether the first audio data includes a preset wake-up word.
  • the second verification module 403 is further configured to:
  • the pre-trained second-level voiceprint verification model is invoked by the processor, where the second-level voiceprint verification model speaks the sample voice of the preset wake-up word by the preset user Trained
  • the second-level voiceprint verification model is used to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice.
  • the instruction execution device provided in the embodiment of the present application further includes a model acquisition module, which is used to:
  • the processor controls the microphone to collect the sample voice of the preset user uttering the preset wake-up word
  • the processor uses the sample voice to adaptively train the general verification model to obtain the secondary voiceprint verification model.
  • the first verification module 402 is further configured to:
  • the instruction execution module 404 is further configured to:
  • the voice interactive application recognizes the voice command carried in the second audio data, and the recognition capability of the voice interactive application is greater than the recognition capability of the instruction recognition model.
  • the instruction execution device provided by the embodiment of the present application further includes a state switching module, which is used to switch the screen of the electronic device to the on-screen state.
  • the embodiment of the present application provides a storage medium with an instruction execution program stored thereon.
  • the stored instruction execution program is executed on the electronic device provided in the embodiment of the present application, the electronic device is caused to execute the instruction provided in the embodiment of the present application.
  • the storage medium may be a magnetic disk, an optical disc, a read only memory (Read Only Memory, ROM), or a random access device (Random Access Memory, RAM), etc.
  • An embodiment of the present application also provides an electronic device. Please refer to FIG. 5.
  • the electronic device includes a processor 501, a dedicated voice recognition chip 502, a microphone 503, and a memory 504, and the power consumption of the dedicated voice recognition chip 502 is less than that of the processor 501.
  • the dedicated speech recognition chip 502, the processor 501, and the audio acquisition unit 501 establish a communication connection through a communication bus (such as an I2C bus) to realize data interaction.
  • the dedicated voice recognition chip 502 in the embodiment of the application is a dedicated chip designed for voice recognition, such as a digital signal processing chip designed for voice, and a dedicated integrated chip designed for voice. Compared with general-purpose processors, circuit chips, etc., have lower power consumption.
  • the processor in the embodiment of the present application is a general-purpose processor, such as an ARM architecture processor.
  • An instruction execution program is stored in the memory 504, which may be a high-speed random access memory or a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 504 may also include a memory controller to provide the processor 501 and a dedicated voice recognition chip 501 to access the memory 504, and realize the following functions:
  • the dedicated voice recognition chip 502 is used to control the microphone to perform audio collection when the processor 501 is in the dormant state to obtain the first audio data;
  • the processor 501 is configured to verify the first audio data, and when the verification passes, control the microphone to perform audio collection to obtain the second audio data;
  • the pre-trained command recognition model is called to recognize the voice command carried by the second audio data, and execute the aforementioned voice command.
  • the instruction recognition model includes a plurality of instruction recognition models corresponding to different voice instructions.
  • the processor 501 is configured to:
  • the voice command corresponding to the command recognition model with the highest score is used as the voice command carried in the second audio data.
  • the processor 501 when the voice instruction corresponding to the instruction recognition model with the highest score is used as the voice instruction carried in the second audio data, the processor 501 is configured to:
  • the voice instruction corresponding to the instruction recognition model with the highest score and reaching the preset score is used as the voice instruction carried in the second audio data.
  • the dedicated voice recognition chip 502 when verifying the first audio data, is used to:
  • the pre-trained first-level text verification model corresponding to the scene classification result is invoked to verify whether the first audio data includes a preset wake-up word.
  • the processor 501 when verifying the first audio data, is configured to:
  • the pre-trained second-level voiceprint verification model where the second-level voiceprint verification model is obtained by pre-setting the user's sample voice training of the preset wake-up word
  • the second-level voiceprint verification model is used to verify whether the voiceprint feature of the first audio data matches the voiceprint feature of the sample voice.
  • the processor 501 is further configured to:
  • the sample voice is used for adaptive training of the general verification model, and the secondary voiceprint verification model is obtained.
  • the dedicated voice recognition chip 502 also sleeps after waking up the processor 501.
  • the processor 501 is further configured to:
  • the voice interactive application recognizes the voice command carried in the second audio data, and the recognition capability of the voice interactive application is greater than the recognition capability of the instruction recognition model.
  • the processor 501 is further configured to switch the screen of the electronic device to a bright screen state.
  • the electronic device provided in the embodiments of this application belongs to the same concept as the instruction execution method in the above embodiments. Any method provided in the instruction execution method embodiment can be run on the electronic device. The specific implementation process is detailed. See the embodiment of the feature extraction method, which will not be repeated here.
  • the computer program may be stored in a computer readable storage medium, such as stored in the memory of an electronic device, and executed by a processor and a dedicated voice recognition chip in the electronic device, and may include instructions such as The flow of an embodiment of the execution method.
  • the storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种指令执行方法、装置、存储介质及电子设备。该方法包括:在处理器处于休眠状态时,由专用语音识别芯片控制麦克风采集得到第一音频数据(101);基于专用语音识别芯片校验第一音频数据,若校验通过,则唤醒处理器(102);基于处理器校验第一音频数据,若校验通过,则控制麦克风采集得到第二音频数据(103);基于处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行该语音指令(104)。

Description

指令执行方法、装置、存储介质及电子设备
本申请要求于2020年02月27日提交中国专利局、申请号为202010125950.2、发明名称为“指令执行方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理技术领域,具体涉及一种指令执行方法、装置、存储介质及电子设备。
背景技术
目前,用户可以在不方便直接操控电子设备的情况下说出唤醒词来唤醒电子设备,以及说出语音指令来控制电子设备执行特定操作等。
发明内容
本申请实施例提供了一种指令执行方法、装置、存储介质及电子设备,能够提高语音控制的易用性,同时降低电子设备实现语音唤醒的功耗。
第一方面,本申请实施例提供了一种指令执行方法,应用于电子设备,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,所述指令执行方法包括:
在所述处理器处于休眠状态时,通过所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
通过所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
通过所述处理器校验所述第一音频数据,若校验通过,则通过所述处理器控制所述麦克风进行音频采集,得到第二音频数据;
通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
第二方面,本申请实施例提供了一种指令执行装置,应用于电子设备,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,所述指令执行装置包括:
音频采集模块,用于在所述处理器处于休眠状态时,通过所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
第一校验模块,用于通过所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
第二校验模块,用于通过所述处理器校验所述第一音频数据,若校验通过,则通过所述处理器控制所述麦克风进行音频采集,得到第二音频数据;
指令执行模块,用于通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音 指令,并执行所述语音指令。
第三方面,本申请实施例提供了一种存储介质,其上存储有计算机程序,当所述计算机程序在包括处理器、专用语音识别芯片和麦克风的电子设备运行时,使得所述电子设备执行本申请实施例提供的指令执行方法中的步骤,其中,所述专用语音识别芯片的功耗小于所述处理器的功耗。
第四方面,本申请实施例还提供了一种电子设备,所述电子设备包括音频采集单元、处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,其中,
所述专用语音识别芯片用于在所述处理器处于休眠状态时,控制所述麦克风采集的外部的第一音频数据;以及
校验所述第一音频数据,若校验通过,则唤醒所述处理器;
所述处理器用于校验所述第一音频数据,若校验通过,则控制所述麦克风采集的外部的第二音频数据;以及
调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的指令执行方法的一流程示意图。
图2是本申请实施例中调用一级文本校验模型的示意图。
图3是本申请实施例提供的指令执行方法的另一流程示意图。
图4是本申请实施例提供的指令执行装置的结构示意图。
图5是本申请实施例提供的电子设备的结构示意图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
本申请实施例首先提供一种指令执行方法,该指令执行方法的执行主体可以是本申请实施例提供的电子设备,该电子设备包括处理器、专用语音识别芯片和麦克风,且专用语音识别芯片的功耗小于处理器的功耗,该电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等配置有处理器而具有处理能力的设备。
本申请提供一种指令执行方法,应用于电子设备,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,所述指令执行方法包括:
在所述处理器处于休眠状态时,通过所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
通过所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
通过所述处理器校验所述第一音频数据,若校验通过,则通过所述处理器控制所述麦克风进行音频采集,得到第二音频数据;
通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
可选地,在一实施例中,所述指令识别模型包括多个对应不同语音指令的指令识别模型,所述通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,包括:
通过所述处理器调用每一指令识别模型对所述第二音频数据进行评分;
将评分最高的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
可选地,在一实施例中,所述将评分最高的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令,包括:
将评分最高且达到预设评分的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
可选地,在一实施例中,所述通过所述专用语音识别芯片校验所述第一音频数据,包括:
通过所述专用语音识别芯片调用预训练的场景分类模型对所述第一音频数据进行场景分类,得到场景分类结果;
通过所述专用语音识别芯片调用预训练的对应所述场景分类结果的一级文本校验模型校验所述第一音频数据中是否包括预设唤醒词。
可选地,在一实施例中,所述通过所述处理器校验所述第一音频数据,包括:
通过所述处理器调用预训练的对应所述预设唤醒词的二级文本校验模型,校验所述第一音频数据中是否包括所述预设唤醒词;
当所述第一音频数据中包括所述预设唤醒词时,通过所述处理器调用预训练的二级声纹校验模型,其中,所述二级声纹校验模型通过预设用户说出所述预设唤醒词的样本语音训练得到;
通过所述二级声纹校验模型校验所述第一音频数据的声纹特征是否与所述样本语音的声纹特征匹配。
可选地,在一实施例中,还包括:
通过所述处理器获取预训练的对应所述预设唤醒词的通用校验模型,将所述通用校验模型设为所述二级文本校验模型;
通过所述处理器控制所述麦克风采集预设用户说出所述预设唤醒词的样本语音;
通过所述处理器利用所述样本语音对所述通用校验模型进行自适应训练,得到所述二级声纹校验模型。
可选地,在一实施例中,所述唤醒所述处理器之后,还包括:
控制所述专用语音识别芯片休眠。
可选地,在一实施例中,所述通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令之后,还包括:
当未识别到所述第二音频数据携带的语音指令时,通过所述处理器后台启动语音交互应用;
通过所述语音交互应用识别所述第二音频数据携带的语音指令,所述语音交互应用的识别能力大于所述指令识别模型的识别能力。
可选地,在一实施例中,还包括:
通过所述处理器切换所述电子设备的屏幕至亮屏状态。
请参照图1,图1为本申请实施例提供的指令执行方法的流程示意图。该指令执行方法应用于本申请提供的电子设备,该电子设备包括处理器、专用语音识别芯片和麦克风,如图1所示,本申请实施例提供的指令执行方法的流程可以如下:
在101中,在处理器处于休眠状态时,通过专用语音识别芯片控制麦克风进行音频采集,得到第一音频数据。
应当说明的是,本申请实施例中的专用语音识别芯片是以语音识别为目的而设计的专用芯片,比如以语音为目的而设计的数字信号处理芯片,以语音为目的而设计的专用集成电路芯片等,其相较于通用的处理器,具有更低的功耗。其中,专用语音识别芯片、处理器之间通过通信总线(比如I2C总线)建立通信连接,实现数据的交互。其中,处理器在电子设备的屏幕处于熄屏状态时休眠,而专用语音识别芯片在屏幕处于亮屏状态时休眠。此外,电子设备所包括的麦克风可以是内置的麦克风,也可以是外置的麦克风(可以是有线的麦克风,也可以是无线的麦克风)。
本申请实施例中,在处理器处于休眠状态时(专用语音识别芯片处于唤醒状态),专用语音识别芯片控制麦克风对外部的声音进行采集,将采集得到的音频数据记为第一音频数据。
在102中,通过专用语音识别芯片校验第一音频数据,若校验通过,则唤醒处理器。
其中,专用语音识别芯片对第一音频数据的校验包括但不限于校验第一音频数据的文本特征和/或声纹特征。
通俗的说,校验第一音频数据的文本特征也即是校验第一音频数据中是否包括预设唤醒词,只要第一音频数据包括预设唤醒词,即校验第一音频数据的文本特征通过,而不论该预设唤醒词由谁说出。比如,第一音频数据包括了预设用户(比如,电子设备的机主,或者机主授权使用电子设备的其他用户)设置的预设唤醒词,但是该预设唤醒词由用户A说出,而不是预设用户,专用语音识别芯片在通过第一校验算法校验第一音频数据的文本特征时,将校验通过。
而校验第一音频数据的文本特征以及声纹特征也即是校验第一音频数据中是否包括预设用户说出的预设唤醒词,若第一音频数据中包括预设用户说出的预设唤醒词,则第一音频数据的文本特征以及声纹特征校验通过,否则校验不通过。比如,第一音频数据包括了预设用户设置的预设唤醒词,且该预设唤醒词由预设用户说出,则该第一音频数据的文本特征以及声纹特征校验通过;又比如,第一音频数据包括了预设用户之外的其他用户说出的预设唤醒词,或者第一音频数据不包括任何用户说出的预设唤醒词时,则该第一音频数据的文本特征以及声纹特征将校验失败(或者说未校验通过)。
本申请实施例中,专用语音识别芯片在校验第一音频数据通过时,通过其与处理器之间的通信连接发送预设的中断信号至处理器,以唤醒处理器。
应当说明的是,若第一音频数据未校验通过,则专用语音识别芯片继续控制麦克风进行音频采集,直至第一音频数据通过校验。
在103中,通过处理器校验第一音频数据,若校验通过,则通过处理器控制麦克风进行音频采集,得到第二音频数据。
其中,专用语音识别芯片在唤醒处理器之后,还将第一音频数据提供给处理器,由处理器再次对第一音频数据进行校验。以安卓系统为例,专用语音识别芯片可以通过SoundTrigger框架将第一音频数据提供给处理器。
应当说明的是,处理器对第一音频数据的校验包括但不限于校验前述第一音频数据的文本特征和/或声纹特征。
处理器在对第一音频数据的校验通过时,控制麦克风进行音频采集,并将采集得到的音频数据记为第二音频数据。
此外,若处理器校验第一音频数据通过,处理器还切换屏幕至亮屏状态。
在104中,通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行前述语音指令。
应当说明的是,本申请实施例中还预先采用机器学习算法训练有指令识别模型,该指令识别模型被配置为识别输入的音频数据所携带的语音指令。
相应的,处理器在采集得到第二音频数据之后即调用预训练的指令识别模型,将第二音频数据输入指令识别模型进行识别,得到第二音频数据所携带的语音指令,并执行该语音指令。
比如,当识别到第二音频数据所携带的语音指令为“启动语音交互应用”时,则处理器启动语音交互应用,以通过语音交互应用与用户进行更为复杂的语音交互。
又比如,识别到第二音频数据所携带的语音指令为“播放音乐”时,则处理器启动默认的音乐播放器,以供用户播放期望的音乐。
由上可知,通过在处理器处于休眠状态时,由功耗低于处理器的专用语音识别芯片控制麦克风进行音频采集,得到第一音频数据;然后通过专用语音识别芯片校验第一音频数据,若校验通过,则唤醒处理器;再通过处理器校验第一音频数据,若校验通过,则通过处理器控制麦克风进行音频采集,得到第二音频数据;最后通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行语音指令。由此,能够降低电子设备实现语音唤醒的功耗,同时,由于无需启动语音交互应用来实现语音指令的识别,还提高了语音控制的易用性。
在一实施例中,指令识别模型包括多个对应不同语音指令的指令识别模型,通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,包括:
(1)通过处理器调用每一指令识别模型对第二音频数据进行评分;
(2)将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
应当说明的是,本申请实施例中预先训练有多个对应不同语音指令的指令识别模型。
示例性的,语音指令比如“播放音乐”、“打开微信”、“启动语音交互应用”等。对于每一语音 指令,采集包括该语音指令的样本语音,并提取其语谱图。然后,利用卷积神经网络对提取的语谱图进行训练,得到对应该语音之类的指令识别模型。由此,可以训练得到多个对应不同语音指令的指令识别模型,比如对应“播放音乐”的指令识别模型、对应“打开微信”的指令识别模型、对应“启动语音交互应用”的指令识别模型等。
相应的,在识别第二音频数据携带的语音指令时,处理器调用每一指令识别模型对第二音频数据进行评分,评分高低即反映了第二音频数据携带某一语音指令的概率,一指令识别模型的评分越高,第二音频数据携带该指令识别模型所对应的语音指令的概率越高。
相应的,处理器可以将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
可选地,为确保识别的准确性,处理器还可以将评分最高且达到预设评分的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
在一实施例中,通过专用语音识别芯片校验第一音频数据,包括:
(1)通过专用语音识别芯片调用预训练的场景分类模型对第一音频数据进行场景分类,得到场景分类结果;
(2)通过专用语音识别芯片调用预训练的对应场景分类结果的一级文本校验模型校验第一音频数据中是否包括预设唤醒词。
本申请实施例中,以专用语音识别芯片进行的一级校验包括文本特征的校验为例进行说明。
应当说明的是,本申请实施例中还根据不同已知场景的样本语音,采用机器学习算法预训练有场景分类模型,利用该场景分类模型可以实现对电子设备所处场景的分类。
此外,本申请实施例在电子设备中预先设置有一级文本校验模型集合,一级文本校验模型集合中包括多个预先在不同场景下训练得到的对应预设唤醒词的一级文本校验模型,以适于专用语音识别芯片在不同的场景下加载,从而更灵活准确的对第一音频数据中是否包括预设唤醒词进行校验。
相应的,在得到对应第一音频数据的场景分类结果之后,电子设备即通过专用语音识别芯片从一级文本校验模型集合中调用对应该场景分类结果的一级文本校验模型,并通过该一级文本校验模型校验第一音频数据中是否包括预设唤醒词,是则校验通过,否则校验失败。
比如,请参照图2,一级文本校验模型集合中包括四个一级文本校验模型,分别为适于在A场景进行音频校验的一级文本校验模型A、适于在B场景进行音频校验的一级文本校验模型B、适于在C场景进行音频校验的一级文本校验模型C以及适于在D场景进行音频校验的一级文本校验模型D。假设场景分类结果指示第一音频数据对应的场景为B场景,则电子设备通过专用语音识别芯片从一级文本校验模型集合中加载一级文本校验模型B;假设场景分类结果指示第一音频数据对应的场景为B场景,则电子设备通过专用语音识别芯片从一级文本校验模型集合中加载一级文本校验模型B,以此类推。
在一实施例中,通过处理器校验第一音频数据,包括:
(1)通过处理器调用预训练的对应预设唤醒词的二级文本校验模型,校验第一音频数据中是否包括预设唤醒词;
(2)当第一音频数据中包括预设唤醒词时,通过处理器调用预训练的二级声纹校验模型,其中,二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到;
(3)通过二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配。
本申请实施例中,以处理器进行文本特征和声纹特征的校验为例进行说明。
其中,首先通过处理器调用预训练的对应预设唤醒词的二级文本校验模型,利用该二级文本校验模型校验第一音频数据中是否包括预设唤醒词。
示例性的,二级文本校验模型可由评分函数训练得到,其中,评分函数用于将向量映射到数值,以此为约束,可由本领域普通技术人员根据实际需要选取合适的函数作为评分函数,本发明实施例对此不做具体限制。
在利用二级文本校验模型校验第一音频数据中是否包括预设唤醒词时,首先提取能够表征第一音频数据的特征向量,将该特征向量输入到二级文本校验模型中进行评分,得到对应的评分分值。然后,比较该评分分值以及二级文本校验模型对应的判别分值,若评分分值达到二级文本校验模型对应的判别分值,则判定第一音频数据中包括预设唤醒词。
其中,在校验出第一音频数据中包括预设唤醒词时,进一步通过处理器调用预训练的二级声纹校验模型,该二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到。然后,利用该二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配。
示例性的,二级声纹校验模型可由二级文本校验模型通过样本语音进一步训练得到。在利用二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配时,首先提取能够表征第一音频数据的特征向量,将该特征向量输入到二级声纹校验模型中进行评分,得到对应的评分分值。然后,比较该评分分值以及二级声纹校验模型对应的判别分值,若评分分值达到二级声纹校验模型对应的判别分值,则判定第一音频数据的声纹特征与样本语音的声纹特征匹配,此时判定校验通过,否则判定校验失败。
在一实施例中,本申请提供的指令执行方法还包括:
(1)通过处理器获取预训练的对应预设唤醒词的通用校验模型,将通用校验模型设为二级文本校验模型;
(2)通过处理器控制麦克风采集预设用户说出预设唤醒词的样本语音;
(3)通过处理器利用样本语音对通用校验模型进行自适应训练,得到二级声纹校验模型。
比如,可以预先采集多人(比如200人)说出预设唤醒词的样本信号,然后分别提取这些样本信号的声学特征(比如梅尔频率倒谱系数),再根据这些样本信号的声学特征训练得到一个与预设唤醒词对应的通用校验模型。由于通用校验模型采用大量与特定人(即用户)无关的音频信号训练得到,其仅拟合人的声学特征分布,并不代表某个具体的人。
本申请实施例中,可以通过处理器获取预训练的对应预设唤醒词的通用校验模型,将该通用校验模型设为二级文本校验模型。
此外,还通过处理器控制麦克风采集预设用户说出预设唤醒词的样本语音。然后,通过处理器提取 样本语音的声学特征,并通过通用校验模型对声学特征进行自适应训练,将自适应训练后的通用校验模型设为二级声纹校验模型。其中,该自适应训练可以采用最大后验估计算法实现。
在一实施例中,唤醒处理器之后,还包括:
控制专用语音识别芯片休眠。
本申请实施例中,在唤醒处理器之后,可以控制专用语音识别芯片休眠,以节省电量。
在一实施例中,通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令之后,还包括:
(1)当未识别到第二音频数据携带的语音指令时,通过处理器后台启动语音交互应用;
(2)通过语音交互应用识别第二音频数据携带的语音指令,语音交互应用的识别能力大于指令识别模型的识别能力。
应当说明的是,本申请中的指令识别模型相较于语音交互应用的识别能力较弱,其适用于快捷操作的执行。因此,可能存在指令识别模型识别失败的情况(可能是指令识别模型未识别到,也可能是第二音频数据中不存在语音指令),此时,由处理器在后台启动识别能力更强的语音交互应用,并通过该语音交互应用识别第二音频数据携带的语音指令,若识别到,则执行识别到的语音指令。
图3为本申请实施例提供的指令执行方法的另一流程示意图。该指令执行方法应用于本申请提供的电子设备,该电子设备包括处理器、专用语音识别芯片和麦克风,如图3所示,本申请实施例提供的指令执行方法的流程可以如下:
在201中,在处理器处于休眠状态时,通过专用语音识别芯片控制麦克风进行音频采集,得到第一音频数据。
应当说明的是,本申请实施例中的专用语音识别芯片是以语音识别为目的而设计的专用芯片,比如以语音为目的而设计的数字信号处理芯片,以语音为目的而设计的专用集成电路芯片等,其相较于通用的处理器,具有更低的功耗。其中,专用语音识别芯片、处理器之间通过通信总线(比如I2C总线)建立通信连接,实现数据的交互。其中,处理器在电子设备的屏幕处于熄屏状态时休眠,而专用语音识别芯片在屏幕处于亮屏状态时休眠。此外,电子设备所包括的麦克风可以是内置的麦克风,也可以是外置的麦克风(可以是有线的麦克风,也可以是无线的麦克风)。
本申请实施例中,在处理器处于休眠状态时(专用语音识别芯片处于唤醒状态),专用语音识别芯片控制麦克风对外部的声音进行采集,将采集得到的音频数据记为第一音频数据。
在202中,通过专用语音识别芯片调用预训练的场景分类模型对第一音频数据进行场景分类,得到场景分类结果。
在203中,通过专用语音识别芯片调用预训练的对应场景分类结果的一级文本校验模型校验第一音频数据中是否包括预设唤醒词,若校验通过,则唤醒处理器。
本申请实施例中,以专用语音识别芯片进行的一级校验包括文本特征的校验为例进行说明。
应当说明的是,本申请实施例中还根据不同已知场景的样本语音,采用机器学习算法预训练有场景 分类模型,利用该场景分类模型可以实现对电子设备所处场景的分类。
此外,本申请实施例在电子设备中预先设置有一级文本校验模型集合,一级文本校验模型集合中包括多个预先在不同场景下训练得到的对应预设唤醒词的一级文本校验模型,以适于专用语音识别芯片在不同的场景下加载,从而更灵活准确的对第一音频数据中是否包括预设唤醒词进行校验。
相应的,在得到对应第一音频数据的场景分类结果之后,电子设备即通过专用语音识别芯片从一级文本校验模型集合中调用对应该场景分类结果的一级文本校验模型,并通过该一级文本校验模型校验第一音频数据中是否包括预设唤醒词,是则校验通过,否则校验失败。
比如,请参照图2,一级文本校验模型集合中包括四个一级文本校验模型,分别为适于在A场景进行音频校验的一级文本校验模型A、适于在B场景进行音频校验的一级文本校验模型B、适于在C场景进行音频校验的一级文本校验模型C以及适于在D场景进行音频校验的一级文本校验模型D。假设场景分类结果指示第一音频数据对应的场景为B场景,则电子设备通过专用语音识别芯片从一级文本校验模型集合中加载一级文本校验模型B;假设场景分类结果指示第一音频数据对应的场景为B场景,则电子设备通过专用语音识别芯片从一级文本校验模型集合中加载一级文本校验模型B,以此类推。
本申请实施例中,专用语音识别芯片在校验第一音频数据通过时,通过其与处理器之间的通信连接发送预设的中断信号至处理器,以唤醒处理器。
应当说明的是,若第一音频数据未校验通过,则专用语音识别芯片继续控制麦克风进行音频采集,直至第一音频数据通过校验。
在204中,通过处理器调用预训练的对应预设唤醒词的二级文本校验模型,校验第一音频数据中是否包括预设唤醒词。
其中,专用语音识别芯片在唤醒处理器之后,还将第一音频数据提供给处理器,由处理器再次对第一音频数据进行校验。以安卓系统为例,专用语音识别芯片可以通过SoundTrigger框架将第一音频数据提供给处理器。
本申请实施例中,以处理器进行文本特征和声纹特征的校验为例进行说明。
其中,首先通过处理器调用预训练的对应预设唤醒词的二级文本校验模型,利用该二级文本校验模型校验第一音频数据中是否包括预设唤醒词。
示例性的,二级文本校验模型可由评分函数训练得到,其中,评分函数用于将向量映射到数值,以此为约束,可由本领域普通技术人员根据实际需要选取合适的函数作为评分函数,本发明实施例对此不做具体限制。
在利用二级文本校验模型校验第一音频数据中是否包括预设唤醒词时,首先提取能够表征第一音频数据的特征向量,将该特征向量输入到二级文本校验模型中进行评分,得到对应的评分分值。然后,比较该评分分值以及二级文本校验模型对应的判别分值,若评分分值达到二级文本校验模型对应的判别分值,则判定第一音频数据中包括预设唤醒词。
在205中,当第一音频数据中包括预设唤醒词时,通过处理器调用预训练的二级声纹校验模型,其中,二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到。
在206中,通过二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配,若校验通过,则通过处理器控制麦克风进行音频采集,得到第二音频数据。
其中,在校验出第一音频数据中包括预设唤醒词时,进一步通过处理器调用预训练的二级声纹校验模型,该二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到。然后,利用该二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配。
示例性的,二级声纹校验模型可由二级文本校验模型通过样本语音进一步训练得到。在利用二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配时,首先提取能够表征第一音频数据的特征向量,将该特征向量输入到二级声纹校验模型中进行评分,得到对应的评分分值。然后,比较该评分分值以及二级声纹校验模型对应的判别分值,若评分分值达到二级声纹校验模型对应的判别分值,则判定第一音频数据的声纹特征与样本语音的声纹特征匹配,此时判定校验通过,否则判定校验失败。
处理器在对第一音频数据的校验通过时,控制麦克风进行音频采集,并将采集得到的音频数据记为第二音频数据。
此外,若处理器校验第一音频数据通过,处理器还切换屏幕至亮屏状态。
在207中,通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行语音指令。
应当说明的是,本申请实施例中还预先采用机器学习算法训练有指令识别模型,该指令识别模型被配置为识别输入的音频数据所携带的语音指令。
相应的,处理器在采集得到第二音频数据之后即调用预训练的指令识别模型,将第二音频数据输入指令识别模型进行识别,得到第二音频数据所携带的语音指令,并执行该语音指令。
比如,当识别到第二音频数据所携带的语音指令为“启动语音交互应用”时,则处理器启动语音交互应用,以通过语音交互应用与用户进行更为复杂的语音交互。
又比如,识别到第二音频数据所携带的语音指令为“播放音乐”时,则处理器启动默认的音乐播放器,以供用户播放期望的音乐。
请参照图4,图4为本申请实施例提供的指令执行装置的结构示意图。该指令执行装置可以应用于电子设备,该电子设备包括处理器、专用语音识别芯片和麦克风,且专用语音识别芯片的功耗小于处理器的功耗。指令执行装置可以包括音频采集模块401、第一校验模块402、第二校验模块403以及指令执行模块404,其中,
音频采集模块401,用于在处理器处于休眠状态时,通过专用语音识别芯片控制麦克风进行音频采集,得到第一音频数据;
第一校验模块402,用于通过专用语音识别芯片校验第一音频数据,若校验通过,则唤醒处理器;
第二校验模块403,用于通过处理器校验第一音频数据,若校验通过,则通过处理器控制麦克风进行音频采集,得到第二音频数据;
指令执行模块404,用于通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行前述语音指令。
在一实施例中,指令识别模型包括多个对应不同语音指令的指令识别模型,在通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令时,指令执行模块404用于:
通过处理器调用每一指令识别模型对第二音频数据进行评分;
将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
在一实施例中,在将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令时,指令执行模块404用于:
将评分最高且达到预设评分的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
在一实施例中,在通过专用语音识别芯片校验第一音频数据,第一校验模块402用于:
通过专用语音识别芯片调用预训练的场景分类模型对第一音频数据进行场景分类,得到场景分类结果;
通过专用语音识别芯片调用预训练的对应场景分类结果的一级文本校验模型校验第一音频数据中是否包括预设唤醒词。
在一实施例中,在通过处理器校验第一音频数据时,第二校验模块403还用于:
通过处理器调用预训练的对应预设唤醒词的二级文本校验模型,校验第一音频数据中是否包括预设唤醒词;
当第一音频数据中包括预设唤醒词时,通过处理器调用预训练的二级声纹校验模型,其中,二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到;
通过二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配。
在一实施例中,本申请实施例提供的指令执行装置还包括模型获取模块,用于:
通过处理器获取预训练的对应预设唤醒词的通用校验模型,将通用校验模型设为二级文本校验模型;
通过处理器控制麦克风采集预设用户说出预设唤醒词的样本语音;
通过处理器利用样本语音对通用校验模型进行自适应训练,得到二级声纹校验模型。
在一实施例中,在唤醒处理器之后,第一校验模块402还用于:
控制专用语音识别芯片休眠。
在一实施例中,在通过处理器调用预训练的指令识别模型识别第二音频数据携带的语音指令之后,指令执行模块404还用于:
当未识别到第二音频数据携带的语音指令时,通过处理器后台启动语音交互应用;
通过语音交互应用识别第二音频数据携带的语音指令,语音交互应用的识别能力大于指令识别模型的识别能力。
在一实施例中,本申请实施例提供的指令执行装置还包括状态切换模块,用于切换电子设备的屏幕至亮屏状态。
本申请实施例提供一种存储介质,其上存储有指令执行程序,当其存储的指令执行程序在本申请实施例提供的电子设备上执行时,使得电子设备执行如本申请实施例提供的指令执行方法中的步骤。其中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM)或者随机存取器(Random Access Memory,RAM)等。
本申请实施例还提供一种电子设备,请参照图5,电子设备包括处理器501、专用语音识别芯片502、麦克风503和存储器504,且专用语音识别芯片502的功耗小于处理器501的功耗,其中,专用语音识别芯片502、处理器501以及音频采集单元501任意二者之间通过通信总线(比如I2C总线)建立通信连接,实现数据的交互。
应当说明的是,本申请实施例中的专用语音识别芯片502是以语音识别为目的而设计的专用芯片,比如以语音为目的而设计的数字信号处理芯片,以语音为目的而设计的专用集成电路芯片等,其相较于通用处理器,具有更低的功耗。
本申请实施例中的处理器是通用处理器,比如ARM架构的处理器。
存储器504中存储有指令执行程序,其可以为高速随机存取存储器,还可以为非易失性存储器,比如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件等。相应地,存储器504还可以包括存储器控制器,以提供处理器501、专用语音识别芯片501对存储器504的访问,实现如下功能:
专用语音识别芯片502用于在处理器501处于休眠状态时,控制麦克风进行音频采集,得到第一音频数据;以及
校验第一音频数据,并在校验通过时唤醒处理器501;
处理器501用于校验第一音频数据,并在校验通过时控制麦克风进行音频采集,得到第二音频数据;以及
调用预训练的指令识别模型识别第二音频数据携带的语音指令,并执行前述语音指令。
在一实施例中,指令识别模型包括多个对应不同语音指令的指令识别模型,在调用预训练的指令识别模型识别第二音频数据携带的语音指令时,处理器501用于:
调用每一指令识别模型对第二音频数据进行评分;
将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
在一实施例中,在将评分最高的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令时,处理器501用于:
将评分最高且达到预设评分的指令识别模型所对应的语音指令作为第二音频数据携带的语音指令。
在一实施例中,在校验第一音频数据时,专用语音识别芯片502用于:
调用预训练的场景分类模型对第一音频数据进行场景分类,得到场景分类结果;
调用预训练的对应场景分类结果的一级文本校验模型校验第一音频数据中是否包括预设唤醒词。
在一实施例中,在校验第一音频数据时,处理器501用于:
调用预训练的对应预设唤醒词的二级文本校验模型,校验第一音频数据中是否包括预设唤醒词;
当第一音频数据中包括预设唤醒词时,调用预训练的二级声纹校验模型,其中,二级声纹校验模型通过预设用户说出预设唤醒词的样本语音训练得到;
通过二级声纹校验模型校验第一音频数据的声纹特征是否与样本语音的声纹特征匹配。
在一实施例中,处理器501还用于:
获取预训练的对应预设唤醒词的通用校验模型,将通用校验模型设为二级文本校验模型;
控制麦克风采集预设用户说出预设唤醒词的样本语音;
利用样本语音对通用校验模型进行自适应训练,得到二级声纹校验模型。
在一实施例中,专用语音识别芯片502还在唤醒处理器501之后休眠。
在一实施例中,在调用预训练的指令识别模型识别第二音频数据携带的语音指令之后,处理器501还用于:
当未识别到第二音频数据携带的语音指令时,后台启动语音交互应用;
通过语音交互应用识别第二音频数据携带的语音指令,语音交互应用的识别能力大于指令识别模型的识别能力。
在一实施例中,处理器501还用于切换电子设备的屏幕至亮屏状态。
应当说明的是,本申请实施例提供的电子设备与上文实施例中的指令执行方法属于同一构思,在电子设备上可以运行指令执行方法实施例中提供的任一方法,其具体实现过程详见特征提取方法实施例,此处不再赘述。
需要说明的是,对本申请实施例的指令执行方法而言,本领域普通测试人员可以理解实现本申请实施例的指令执行方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的处理器和专用语音识别芯片执行,在执行过程中可包括如指令执行方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。
以上对本申请实施例所提供的一种指令执行方法、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种指令执行方法,应用于电子设备,其中,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,所述指令执行方法包括:
    在所述处理器处于休眠状态时,通过所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
    通过所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
    通过所述处理器校验所述第一音频数据,若校验通过,则通过所述处理器控制所述麦克风进行音频采集,得到第二音频数据;
    通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
  2. 根据权利要求1所述的指令执行方法,其中,所述指令识别模型包括多个对应不同语音指令的指令识别模型,所述通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,包括:
    通过所述处理器调用每一指令识别模型对所述第二音频数据进行评分;
    将评分最高的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
  3. 根据权利要求2所述的指令执行方法,其中,所述将评分最高的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令,包括:
    将评分最高且达到预设评分的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
  4. 根据权利要求1所述的指令执行方法,其中,所述通过所述专用语音识别芯片校验所述第一音频数据,包括:
    通过所述专用语音识别芯片调用预训练的场景分类模型对所述第一音频数据进行场景分类,得到场景分类结果;
    通过所述专用语音识别芯片调用预训练的对应所述场景分类结果的一级文本校验模型校验所述第一音频数据中是否包括预设唤醒词。
  5. 根据权利要求4所述的指令执行方法,其中,所述通过所述处理器校验所述第一音频数据,包括:
    通过所述处理器调用预训练的对应所述预设唤醒词的二级文本校验模型,校验所述第一音频数据中是否包括所述预设唤醒词;
    当所述第一音频数据中包括所述预设唤醒词时,通过所述处理器调用预训练的二级声纹校验模型,其中,所述二级声纹校验模型通过预设用户说出所述预设唤醒词的样本语音训练得到;
    通过所述二级声纹校验模型校验所述第一音频数据的声纹特征是否与所述样本语音的声纹特征匹配。
  6. 根据权利要求5所述的指令执行方法,其中,还包括:
    通过所述处理器获取预训练的对应所述预设唤醒词的通用校验模型,将所述通用校验模型设为所述 二级文本校验模型;
    通过所述处理器控制所述麦克风采集预设用户说出所述预设唤醒词的样本语音;
    通过所述处理器利用所述样本语音对所述通用校验模型进行自适应训练,得到所述二级声纹校验模型。
  7. 根据权利要求1所述的指令执行方法,其中,所述唤醒所述处理器之后,还包括:
    控制所述专用语音识别芯片休眠。
  8. 根据权利要求1所述的指令执行方法,其中,所述通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令之后,还包括:
    当未识别到所述第二音频数据携带的语音指令时,通过所述处理器后台启动语音交互应用;
    通过所述语音交互应用识别所述第二音频数据携带的语音指令,所述语音交互应用的识别能力大于所述指令识别模型的识别能力。
  9. 根据权利要求1所述的指令执行方法,其中,还包括:
    通过所述处理器切换所述电子设备的屏幕至亮屏状态。
  10. 一种指令执行装置,应用于电子设备,其中,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语音识别芯片的功耗小于所述处理器的功耗,所述指令执行装置包括:
    音频采集模块,用于在所述处理器处于休眠状态时,通过所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
    第一校验模块,用于通过所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
    第二校验模块,用于通过所述处理器校验所述第一音频数据,若校验通过,则通过所述处理器控制所述麦克风进行音频采集,得到第二音频数据;
    指令执行模块,用于通过所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
  11. 一种存储介质,其中,当所述存储介质中存储的计算机程序在包括处理器、专用语音识别芯片和麦克风的电子设备运行时,使得所述处理器和执行专用语音识别芯片执行:
    在所述处理器处于休眠状态时,所述专用语音识别芯片控制所述麦克风进行音频采集,得到第一音频数据;
    所述专用语音识别芯片校验所述第一音频数据,若校验通过,则唤醒所述处理器;
    所述处理器校验所述第一音频数据,若校验通过,则控制所述麦克风进行音频采集,得到第二音频数据;
    所述处理器调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令;
    其中,所述专用语音识别芯片的功耗小于所述处理器的功耗。
  12. 一种电子设备,其中,所述电子设备包括处理器、专用语音识别芯片和麦克风,且所述专用语 音识别芯片的功耗小于所述处理器的功耗,其中,
    所述专用语音识别芯片用于在所述处理器处于休眠状态时,控制所述麦克风采集的外部的第一音频数据;以及
    校验所述第一音频数据,若校验通过,则唤醒所述处理器;
    所述处理器用于校验所述第一音频数据,若校验通过,则控制所述麦克风采集的外部的第二音频数据;以及
    调用预训练的指令识别模型识别所述第二音频数据携带的语音指令,并执行所述语音指令。
  13. 根据权利要求12所述的电子设备,其中,所述指令识别模型包括多个对应不同语音指令的指令识别模型,所述处理器用于调用每一指令识别模型对所述第二音频数据进行评分;以及将评分最高的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
  14. 根据权利要求13所述的电子设备,其中,所述处理器用于将评分最高且达到预设评分的指令识别模型所对应的语音指令作为所述第二音频数据携带的语音指令。
  15. 根据权利要求12所述的电子设备,其中,所述专用语音识别芯片用于调用预训练的场景分类模型对所述第一音频数据进行场景分类,得到场景分类结果;以及调用预训练的对应所述场景分类结果的一级文本校验模型校验所述第一音频数据中是否包括预设唤醒词。
  16. 根据权利要求15所述的电子设备,其中,所述处理器用于调用预训练的对应所述预设唤醒词的二级文本校验模型,校验所述第一音频数据中是否包括所述预设唤醒词;以及当所述第一音频数据中包括所述预设唤醒词时,调用预训练的二级声纹校验模型,其中,所述二级声纹校验模型通过预设用户说出所述预设唤醒词的样本语音训练得到;以及通过所述二级声纹校验模型校验所述第一音频数据的声纹特征是否与所述样本语音的声纹特征匹配。
  17. 根据权利要求16所述的电子设备,其中,所述处理器还用于获取预训练的对应所述预设唤醒词的通用校验模型,将所述通用校验模型设为所述二级文本校验模型;以及控制所述麦克风采集预设用户说出所述预设唤醒词的样本语音;以及利用所述样本语音对所述通用校验模型进行自适应训练,得到所述二级声纹校验模型。
  18. 根据权利要求16所述的电子设备,其中,所述专用语音识别芯片还用于在唤醒所述处理器后休眠。
  19. 根据权利要求16所述的电子设备,其中,所述处理器还用于当未识别到所述第二音频数据携带的语音指令时,通过所述处理器后台启动语音交互应用;以及通过所述语音交互应用识别所述第二音频数据携带的语音指令,所述语音交互应用的识别能力大于所述指令识别模型的识别能力。
  20. 根据权利要求12所述的电子设备,其中,所述处理器还用于切换所述电子设备的屏幕至亮屏状态。
PCT/CN2021/073831 2020-02-27 2021-01-26 指令执行方法、装置、存储介质及电子设备 WO2021169711A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21759843.2A EP4095850A1 (en) 2020-02-27 2021-01-26 Instruction execution method and apparatus, storage medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010125950.2A CN111369992A (zh) 2020-02-27 2020-02-27 指令执行方法、装置、存储介质及电子设备
CN202010125950.2 2020-02-27

Publications (1)

Publication Number Publication Date
WO2021169711A1 true WO2021169711A1 (zh) 2021-09-02

Family

ID=71211553

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073831 WO2021169711A1 (zh) 2020-02-27 2021-01-26 指令执行方法、装置、存储介质及电子设备

Country Status (3)

Country Link
EP (1) EP4095850A1 (zh)
CN (1) CN111369992A (zh)
WO (1) WO2021169711A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111369992A (zh) * 2020-02-27 2020-07-03 Oppo(重庆)智能科技有限公司 指令执行方法、装置、存储介质及电子设备
CN115711077A (zh) * 2022-11-29 2023-02-24 重庆长安汽车股份有限公司 一种车辆电动门无接触式控制方法、系统及汽车

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105283836A (zh) * 2013-07-11 2016-01-27 英特尔公司 利用相同的音频输入的设备唤醒和说话者验证
CN105744074A (zh) * 2016-03-30 2016-07-06 青岛海信移动通信技术股份有限公司 一种在移动终端中进行语音操作方法和装置
US20180130485A1 (en) * 2016-11-08 2018-05-10 Samsung Electronics Co., Ltd. Auto voice trigger method and audio analyzer employing the same
CN108958810A (zh) * 2018-02-09 2018-12-07 北京猎户星空科技有限公司 一种基于声纹的用户识别方法、装置及设备
CN109979438A (zh) * 2019-04-04 2019-07-05 Oppo广东移动通信有限公司 语音唤醒方法及电子设备
CN110021307A (zh) * 2019-04-04 2019-07-16 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110211599A (zh) * 2019-06-03 2019-09-06 Oppo广东移动通信有限公司 应用唤醒方法、装置、存储介质及电子设备
CN110223687A (zh) * 2019-06-03 2019-09-10 Oppo广东移动通信有限公司 指令执行方法、装置、存储介质及电子设备
CN110473554A (zh) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110580897A (zh) * 2019-08-23 2019-12-17 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110581915A (zh) * 2019-08-30 2019-12-17 Oppo广东移动通信有限公司 稳定性测试方法、装置、存储介质及电子设备
CN110602624A (zh) * 2019-08-30 2019-12-20 Oppo广东移动通信有限公司 音频测试方法、装置、存储介质及电子设备
CN110689887A (zh) * 2019-09-24 2020-01-14 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN111369992A (zh) * 2020-02-27 2020-07-03 Oppo(重庆)智能科技有限公司 指令执行方法、装置、存储介质及电子设备

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302456C (zh) * 2005-04-01 2007-02-28 郑方 一种声纹识别方法
CN101315770B (zh) * 2008-05-27 2012-01-25 北京承芯卓越科技有限公司 语音识别片上系统及采用其的语音识别方法
US9245527B2 (en) * 2013-10-11 2016-01-26 Apple Inc. Speech recognition wake-up of a handheld portable electronic device
CN106847283A (zh) * 2017-02-28 2017-06-13 广东美的制冷设备有限公司 智能家电控制方法和装置
CN108520743B (zh) * 2018-02-02 2021-01-22 百度在线网络技术(北京)有限公司 智能设备的语音控制方法、智能设备及计算机可读介质
CN109243461B (zh) * 2018-09-21 2020-04-14 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备及存储介质
CN109688036B (zh) * 2019-02-20 2021-12-03 广州视源电子科技股份有限公司 一种智能家电的控制方法、装置、智能家电和存储介质
CN110265040B (zh) * 2019-06-20 2022-05-17 Oppo广东移动通信有限公司 声纹模型的训练方法、装置、存储介质及电子设备
CN110544468B (zh) * 2019-08-23 2022-07-12 Oppo广东移动通信有限公司 应用唤醒方法、装置、存储介质及电子设备

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105283836A (zh) * 2013-07-11 2016-01-27 英特尔公司 利用相同的音频输入的设备唤醒和说话者验证
CN105744074A (zh) * 2016-03-30 2016-07-06 青岛海信移动通信技术股份有限公司 一种在移动终端中进行语音操作方法和装置
US20180130485A1 (en) * 2016-11-08 2018-05-10 Samsung Electronics Co., Ltd. Auto voice trigger method and audio analyzer employing the same
CN108958810A (zh) * 2018-02-09 2018-12-07 北京猎户星空科技有限公司 一种基于声纹的用户识别方法、装置及设备
CN109979438A (zh) * 2019-04-04 2019-07-05 Oppo广东移动通信有限公司 语音唤醒方法及电子设备
CN110021307A (zh) * 2019-04-04 2019-07-16 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110211599A (zh) * 2019-06-03 2019-09-06 Oppo广东移动通信有限公司 应用唤醒方法、装置、存储介质及电子设备
CN110223687A (zh) * 2019-06-03 2019-09-10 Oppo广东移动通信有限公司 指令执行方法、装置、存储介质及电子设备
CN110473554A (zh) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110580897A (zh) * 2019-08-23 2019-12-17 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN110581915A (zh) * 2019-08-30 2019-12-17 Oppo广东移动通信有限公司 稳定性测试方法、装置、存储介质及电子设备
CN110602624A (zh) * 2019-08-30 2019-12-20 Oppo广东移动通信有限公司 音频测试方法、装置、存储介质及电子设备
CN110689887A (zh) * 2019-09-24 2020-01-14 Oppo广东移动通信有限公司 音频校验方法、装置、存储介质及电子设备
CN111369992A (zh) * 2020-02-27 2020-07-03 Oppo(重庆)智能科技有限公司 指令执行方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
EP4095850A1 (en) 2022-11-30
CN111369992A (zh) 2020-07-03

Similar Documents

Publication Publication Date Title
US10515640B2 (en) Generating dialogue based on verification scores
US10719115B2 (en) Isolated word training and detection using generated phoneme concatenation models of audio inputs
US20170256270A1 (en) Voice Recognition Accuracy in High Noise Conditions
US20160266910A1 (en) Methods And Apparatus For Unsupervised Wakeup With Time-Correlated Acoustic Events
WO2017012511A1 (zh) 语音控制方法、装置及投影仪设备
US9466286B1 (en) Transitioning an electronic device between device states
CN105448294A (zh) 一种应用于车载设备的智能语音识别系统
CN109272991B (zh) 语音交互的方法、装置、设备和计算机可读存储介质
CN110211599B (zh) 应用唤醒方法、装置、存储介质及电子设备
CN110602624B (zh) 音频测试方法、装置、存储介质及电子设备
CN110223687B (zh) 指令执行方法、装置、存储介质及电子设备
WO2020244257A1 (zh) 语音唤醒方法、系统、电子设备及计算机可读存储介质
WO2021169711A1 (zh) 指令执行方法、装置、存储介质及电子设备
CN110581915B (zh) 稳定性测试方法、装置、存储介质及电子设备
CN111261195A (zh) 音频测试方法、装置、存储介质及电子设备
CN110544468B (zh) 应用唤醒方法、装置、存储介质及电子设备
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
CN109712623A (zh) 语音控制方法、装置及计算机可读存储介质
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
CN118020100A (zh) 语音数据的处理方法及装置
US10424292B1 (en) System for recognizing and responding to environmental noises
US20180366127A1 (en) Speaker recognition based on discriminant analysis
WO2019041871A1 (zh) 语音对象识别方法及装置
WO2020102991A1 (zh) 唤醒设备的方法、装置、存储介质及电子设备
KR20120111510A (ko) 대화형 음성 인식을 통한 로봇 제어 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21759843

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021759843

Country of ref document: EP

Effective date: 20220823

NENP Non-entry into the national phase

Ref country code: DE