WO2019041871A1 - 语音对象识别方法及装置 - Google Patents

语音对象识别方法及装置 Download PDF

Info

Publication number
WO2019041871A1
WO2019041871A1 PCT/CN2018/085335 CN2018085335W WO2019041871A1 WO 2019041871 A1 WO2019041871 A1 WO 2019041871A1 CN 2018085335 W CN2018085335 W CN 2018085335W WO 2019041871 A1 WO2019041871 A1 WO 2019041871A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
wake
speech
object recognition
recognition model
Prior art date
Application number
PCT/CN2018/085335
Other languages
English (en)
French (fr)
Inventor
孙凤宇
肖建良
樊伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019041871A1 publication Critical patent/WO2019041871A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Definitions

  • the present application relates to the field of voice recognition technology, and in particular, to a voice object recognition method and apparatus.
  • Speech object recognition or voiceprint recognition is a recognition technique that is realized by human voice. Because there are certain differences in the vocal organs used by people when speaking, the voiceprints of any two people's voices are different, so the voiceprint is different. It can be used as a biometric to characterize individual differences, so it is possible to identify different individuals by establishing a recognition model, and then use the recognition model to identify different individuals. Speech object recognition technology has been widely used due to its advantages of low cost, accuracy and convenience.
  • the intelligent terminal utilizes the voice object recognition technology, and after the user performs voice registration, other users cannot wake up, and the privacy of the user can be guaranteed.
  • the speech object recognition model established according to the registered speech only models the speech object speech according to the limited corpus of the speech object, and once the registration is successful, the speech object recognition model is fixed, and the speech object recognition effect is fixed. It is more dependent on the voice registered by the user. Because the speech of the speech object is related to the subjective factors of the speech object such as speech rate, emotion, etc., it is also related to objective factors such as the physical health of the speech object. As time changes, these main and objective factors of the speech object will affect the pronunciation of the speech object. Since the corpus of the speech object recognition model registered by the speech object is limited, and the speech object speech model cannot be fully modeled, the recognition rate of the speech object recognition system is difficult to break through. The longer the vocabulary of the voiceprint training, the more accurate the established feature model is, and the higher the recognition accuracy is, but the practicality of the model establishment method is not strong.
  • the present application provides a voice object recognition method and apparatus to improve the accuracy of voice object recognition.
  • a voice object recognition method comprising: receiving a voice of a voice object, acquiring a voice signal in the voice, wherein the voice signal includes a voice that awakens a voice signal and a service instruction a signal; matching the wake-up speech signal with a voice object recognition model; if the matching is successful, executing the service instruction; when the service instruction is successfully executed, if a confidence factor corresponding to the wake-up voice signal is The scoring factor corresponding to the service instruction determines that the wake-up speech signal is used as an additional training sample, and the speech object recognition model is trained by using the training sample.
  • the wake-up speech signal is filtered by combining the voice matching and the service instruction execution, thereby improving the accuracy of the voice object recognition.
  • the method further includes: determining whether a sum of a weighting factor of the wake-up voice signal and a score factor of the at least one service instruction is greater than or equal to a set threshold; if the wake-up voice signal The sum of the weighted factors of the confidence factor and the score factor of the at least one service command is greater than or equal to a set threshold, and the wake-up speech signal is determined to be an additional training sample.
  • the scoring factor of the service instruction is related to at least one of the following parameters: privacy of the service, and historical application frequency of the service.
  • privacy of the service the higher the scoring factor of the service instruction; the higher the historical application frequency of the service, the higher the scoring factor of the service instruction when the terminal is awakened when the terminal is awakened.
  • the method before the valid wake-up speech signal is used as an additional training sample, the method further includes: processing the valid wake-up speech signal, wherein the processing includes at least one of the following Operation: Noise reduction processing and removal of silent segment processing.
  • the wake-up speech signal is processed before the wake-up speech signal is used as the additional training sample, so that the accuracy of the speech object recognition model update can be improved.
  • the method before the training the voice object recognition model by using the training sample, the method further includes: establishing the voice object recognition model according to a preset training sample.
  • a speech object recognition model is established for subsequent speech object recognition. It is also possible to use the updated recognition model as a recognition model, repeat the above steps, continuously correct and update the recognition model, and continuously improve the accuracy of the recognition model.
  • the training the voice object recognition model by using the training sample comprises: generating a modified voice object recognition model according to the valid wake-up voice signal and the preset training sample;
  • the modified speech object recognition model updates the speech object recognition model.
  • a voice object recognition device having a function of realizing the behavior of a voice object recognition device in the above method.
  • the functions may be implemented by hardware or by corresponding software implemented by hardware.
  • the hardware or software includes one or more modules corresponding to the functions described above.
  • the voice object recognition apparatus includes: a voice acquiring unit, configured to receive a voice of a voice object, and acquire a voice signal in the voice, where the voice signal includes a wake-up voice signal and a service instruction a voice signal; a wake-up unit, configured to match the wake-up voice signal acquired by the acquiring unit with a voice object recognition model; and an execution unit, configured to execute the service instruction if the matching unit matches successfully; a training unit, configured to determine, according to a confidence factor corresponding to the wake-up voice signal and a score factor corresponding to the service instruction, that the wake-up voice signal is used as an additional training, when the execution unit performs the service instruction successfully.
  • the training object is used to train the speech object recognition model
  • the voice object recognition apparatus includes: a receiver, a transmitter, a memory, and a processor; wherein the memory stores a set of program codes, and the processor is configured to invoke the memory The stored program code, the operation of: receiving a voice of a voice object, acquiring a voice signal in the voice, wherein the voice signal includes a voice signal waking up a voice signal and a service instruction; and the wake-up voice signal and the voice
  • the object recognition model performs matching; if the matching is successful, the service instruction is executed; when the service instruction is executed successfully, if the confidence factor corresponding to the wake-up voice signal and the score factor corresponding to the service instruction are determined,
  • the wake-up speech signal is used as an additional training sample, and the speech object recognition model is trained using the training sample.
  • the processor is further configured to: determine whether a sum of a weighting factor of the wake-up voice signal and a score factor of the at least one service instruction is greater than or equal to a set threshold; if the wake-up voice The weighted sum of the confidence factor of the signal and the score factor of the at least one service command is greater than or equal to the set threshold, and the wake-up speech signal is determined to be an additional training sample.
  • the score factor of the service instruction is related to at least one of the following parameters: the privacy of the service, and the historical application frequency of the service.
  • the processor performs the operation of using the valid wake-up speech signal as an additional training sample, the following operation is further performed: processing the valid wake-up speech signal, wherein the processing includes the following At least one operation: noise reduction processing and removal of silent segment processing.
  • the processor performs the operation of training the voice object recognition model by using the training sample, the following operation is further performed: establishing the voice object recognition model according to a preset training sample.
  • the performing, by the processor, the operation of training the voice object recognition model by using the training sample comprising: generating a modified voice object recognition model according to the valid wake-up voice signal and the preset training sample And updating the voice object recognition model by using the modified voice object recognition model.
  • the method and the beneficial effects of the above-mentioned possible speech object recognition device can be referred to the implementation of the method, and the implementation of the device can be referred to the implementation of the method. I won't go into details here.
  • Yet another aspect of the present application provides a computer readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the methods described in the above aspects.
  • Yet another aspect of the present application provides a communication chip in which instructions are stored that, when run on a network device or terminal device, cause the computer to perform the methods described in the various aspects above.
  • Yet another aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the methods described in the various aspects above.
  • FIG. 1 is a schematic flowchart of a voice object identification method according to an embodiment of the present invention
  • FIG. 2 is a schematic flow chart of further refining a voice object recognition method shown in FIG. 1;
  • FIG. 3 is a schematic diagram of a voice signal noise reduction process
  • FIG. 4 is a schematic diagram of a voice activation detection process
  • FIG. 5 is a schematic structural diagram of a voice object recognition apparatus according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a voice object recognition device according to an embodiment of the present invention.
  • An example application scenario of the present application is that a user wants to make a call to a friend by using a mobile phone.
  • the mobile phone is locked and needs to be unlocked, and currently the voice can be used to wake up the mobile phone for unlocking.
  • the corpus used for voice unlocking is generally short, and the user's voice changes with time, the user's speech rate, mood, physical health, etc. may affect the user's pronunciation, and thus, it is possible to illegally acquire the vocabulary.
  • the voice object of the user's mobile phone may also wake up the mobile phone by mistake, or the user may not wake up the mobile phone because the pronunciation of the user is inconsistent with the registered voice after a period of time.
  • the embodiment of the invention provides a voice object recognition method, device and device, which combines voice matching and service instruction execution to filter wake-up speech signals, and uses a filtered voice signal to update a voice object recognition model, thereby improving voice object recognition. accuracy.
  • FIG. 1 is a schematic flowchart of a method for identifying a voice object according to an embodiment of the present invention, where the method may include the following steps:
  • S101 Receive a voice of a voice object, and acquire a voice signal in the voice, where the voice signal includes a voice signal that wakes up a voice signal and a service instruction.
  • the voice may be an audio stream generated by a voice object through a smart terminal for voice chat or a voice command, or an audio stream obtained by recording or the like.
  • it may be an audio stream that is input by a voice object through a voice input device such as a microphone or detected by a voice sensor.
  • the voice input device receives the voice of the voice object
  • the voice signal in the voice is acquired.
  • the voice may be an analog voice signal
  • the voice signal obtained in this embodiment is an analog-to-digital converted digital voice signal or an electrical signal.
  • the voice object here may be a legal holder of the smart terminal, or may be any user who holds the smart terminal, such as an illegal holder of the smart terminal, or a family member of the legal holder; the voice object may also It is a machine, etc.
  • the voice signal includes two voice signals, that is, a voice signal including a wake-up voice signal and a service command.
  • the terminal in this embodiment may instruct the terminal to perform a service by using a voice instruction.
  • the voice signal of the service instruction may be one or multiple, that is, the voice object may simultaneously issue multiple service instructions.
  • the wake-up speech signal should be a speech signal consistent with the corpus of the registered speech in the speech object recognition model
  • the business instruction speech signal is a speech signal indicating the terminal to execute the business instruction.
  • the user can say "Hello, Xiaoyi, please call a friend's phone" to the microphone of the mobile phone.
  • the voice tone learning function switch is activated or not, the voice object recognition model is updated, and the user can set it as needed.
  • the method may further include: establishing a voice object recognition model according to the preset training samples.
  • the speech object recognition model is a recognition model pre-established according to a training sample of a preset speech signal stream, that is, a training sample associated with a preset speech signal stream is provided in advance, and a speech object recognition model is formed according to the training sample training.
  • the speech object recognition model is a feature model formed after a voiceprint registration process performed for an object.
  • the method provided by the embodiment of the present invention can implement the operation of updating or correcting the model. Therefore, the voice object recognition model can be a recognition model obtained by using the existing method, or can be performed by using the method provided by the embodiment of the present invention.
  • the revised recognition model is a recognition model pre-established according to a training sample of a preset speech signal stream, that is, a training sample associated with a preset speech signal stream is provided in advance, and a speech object recognition model is formed according to the training sample training.
  • the speech object recognition model is a feature model formed after a voiceprint registration process performed for an object.
  • the method provided by the embodiment of the present invention
  • step S102 Match the wake-up speech signal with the voice object recognition model, and is the matching successful? If the matching is successful, proceeding to step S103; otherwise, jumping to step S105, ending the flow, and terminating the training speech object recognition model.
  • the speech object recognition model is established in advance, it is possible to determine whether the matching is successful according to the matching degree of the wake-up speech signal of the speech object and the speech object recognition model.
  • the voiceprint confirmation algorithm interface is invoked to obtain a matching degree between the wake-up voice signal and the voice object recognition model.
  • the matching degree may be calculated by using the wake-up speech signal as an input value of the speech object recognition model, and acquiring a matching degree corresponding to the wake-up speech signal and the speech object recognition model, or a corresponding probability.
  • the degree of matching or probability represents the magnitude of the correlation of the wake-up speech signal with the speech object recognition model. If the calculated matching degree is greater than or equal to the preset matching degree threshold, the wake-up speech signal is considered to be successfully matched with the voice signal recognition model; otherwise, the matching fails.
  • the terminal is woken up.
  • step S103 Execute the service instruction, is the execution successful? If the execution is successful, proceeding to step S104; otherwise, jumping to step S105, ending the flow, and terminating the training speech object recognition model.
  • the voice signal may include one or more service instructions, and the terminal may separately execute the one or more service instructions.
  • the service may be some pre-designated service related to the subsequent training of the speech object recognition model.
  • the business instruction is successfully executed, that is, whether the indicated service is completed. For example, if the voice signal of the service instruction is "calling a friend's phone", the successful execution means that the phone number of the friend *** is found in the address book and dialed; and, for example, if the service command voice signal Is "play music", the success is to open the music player, play the music in the music player according to the default settings.
  • the determination that the service command is successfully executed refers to determining whether each service command is successfully executed.
  • determining whether the service instruction is successfully executed refers to determining whether the service instruction related to determining whether the wake-up voice signal can be used as the additional training sample is successfully executed.
  • the voice signal may also include and determine whether the wake-up voice signal can be used as additional training.
  • Sample-Independent Service Instruction In this embodiment, if any one of the service instructions fails to execute, the wake-up speech signal may be discarded, and the training or update of the voice object recognition model is terminated.
  • the speech object recognition model is trained by using the training sample.
  • the method may further include: determining whether a sum of a weighting factor of the wake-up voice signal and a weighting factor of the at least one service instruction is greater than or equal to a set threshold;
  • the wake-up voice signal has a certain degree of confidence in determining whether the wake-up speech signal can be used as an additional training sample, expressed by a confidence or a confidence factor.
  • the confidence depends on the speech object recognition score when the wake-up speech signal is matched. In theory, the higher the score, the greater the probability that the wake-up speech is the voice object of the registered speech (eg, the legal holder of the terminal), the corresponding confidence The greater the degree or confidence factor. In the prior art, if the terminal can be awakened, the voice object must be considered as the legal holder of the terminal.
  • the wake-up voice signal can be used as an additional training sample needs to consider the score of the service instruction.
  • the confidence factor and the scoring factor of the service instruction may be a number greater than or equal to 0 and less than or equal to 1. In this embodiment, the confidence factor should be less than 1, because the scoring factor for executing the business instruction also needs to be considered.
  • the voice object wakes up the terminal, it can also instruct the terminal to perform certain services. If the voice object is the legal holder of the terminal, the voice object must compare the service or stored content installed in the terminal held by the terminal. Familiar, so if the indicated service can be successfully executed, a certain scoring factor can be given when judging the validity of the wake-up speech signal.
  • the scoring factor of the business instruction is related to at least one of the following parameters: the privacy of the business, and the historical application frequency of the business. That is, the better the privacy of the service, the higher the score factor; conversely, the worse the privacy of the service, the lower the score factor. Similarly, the higher the historical application frequency of the service, the higher the score factor; conversely, the lower the historical application frequency of the service, the lower the score factor. If a business instruction fails to execute, it can be considered that the score factor of the service instruction is 0, or the service instruction is not considered when the validity judgment is made.
  • the present embodiment by comprehensively considering the confidence factor of the wake-up speech signal and the scoring factor of the service instruction, it is possible to improve the accuracy of determining whether the wake-up speech signal can be used as an additional training sample.
  • training the voice object recognition model by using the training sample may further include the following steps:
  • the speech object recognition model is updated using the modified speech object recognition model.
  • the voiceprint registration algorithm interface is invoked according to the wake-up speech signal and the preset training sample to generate a modified speech object recognition model.
  • the preset training sample is also a training sample used to generate the above-mentioned voice object recognition model.
  • the modified speech object recognition model is a more accurate recognition model, and the speech object recognition model is updated by using the modified speech object recognition model (for example, the modified speech object recognition model is saved as a speech object recognition model to replace the previous The speech object recognition model) can achieve the purpose of model adaptation and intelligence.
  • the wake-up speech signal is used as an additional training sample, that is, the voiceprint registration algorithm interface is invoked according to the wake-up speech signal and the preset training sample, and the modified recognition model is generated.
  • the speech object recognition model training algorithm used in this embodiment may be an incremental training method based on GMM-UBM.
  • Other training methods such as i-Vector, d-Vector, etc. can implement speech object recognition model training.
  • the updated recognition model can be used as the recognition model, and the above steps are repeated to continuously correct and update the recognition model, and the accuracy of the recognition model is continuously improved.
  • a voice object recognition model updating method which combines voice matching and service instruction execution to filter wake-up speech, thereby improving the accuracy of the voice object recognition model update.
  • FIG. 2 is a schematic flow chart for further refining a method for updating a voice object recognition model shown in FIG. 1. Taking the wake-up and training speech object recognition model as an example, the method may include the following steps:
  • the user inputs the voice through the microphone of the mobile phone to obtain the voice signal in the voice.
  • the voice signal includes a voice signal that wakes up the voice signal and the service command.
  • step S202 if the matching is successful, the mobile phone is awakened, and the process goes to step S203; otherwise, the process proceeds to step S207, and the training speech object recognition model is terminated.
  • step S102 is the same as step S102 of the embodiment shown in FIG. 1, and details are not described herein again.
  • step S103 is the same as step S103 of the embodiment shown in FIG. 1, and details are not described herein again.
  • step S204 Determine whether a sum of weights of the confidence factor of the wake-up voice signal and a score factor of the service instruction is greater than or equal to a set threshold. If yes, proceed to step S205. Otherwise, proceed to step S207 to terminate the training voice object. Identify the model.
  • the confidence factor of the wake-up voice signal is weighted, and the score factor of at least one service instruction is weighted to determine whether the weighted sum of the two is greater than or equal to the set threshold. If the sum of the weights of the two is greater than or equal to the set threshold, the wake-up speech signal can be regarded as an additional training sample of the speech object recognition model, that is, the speech object recognition model is trained; otherwise, the training of the speech object recognition model is terminated. Discard the wake-up speech signal.
  • the following formula can be used to determine whether the wake-up speech signal can be used as an additional training sample:
  • is the confidence or confidence factor of the wake-up speech signal
  • w s is the confidence weight of the wake-up speech
  • ⁇ k is the scoring factor of the k-th voice interactive service
  • w k is the weight of the k-th voice interactive service
  • Thd 1 is the decision threshold.
  • the confidence level ⁇ of the wake-up speech signal depends on the speech object recognition score when the wake-up speech signal matches. In theory, the higher the score, the greater the probability that the wake-up speech signal is the registered speech object, and the corresponding confidence value ⁇ is higher. Big.
  • the voice interactive service score factor ⁇ where the service refers to the fixed service of the mobile phone system, such as making a call, sending a text message, sending an E-mail, etc., and an APP service, such as real-time communication, third-party electronic payment, mobile games, and the like.
  • Table 1 is an illustration of the scoring factors for these services. The higher the score, the greater the probability that the wake-up speech signal is the speech object of the registered speech.
  • w s , w k and Thd 1 are constant, and these constants can be adjusted in practical applications to optimize system performance.
  • the scoring factor of the service instruction is related to at least one of the following parameters: the privacy of the service, and the historical application frequency of the service.
  • the relationship between the privacy of the business and the scoring factor of the business for example, for the service of the call in Table 1, since the address book needs to be searched, if the voice object is the legal holder of the terminal, it is generally clear in the address book of the terminal. The number of the telephone listener is stored, and if the call of the other party can be successfully dialed, the score factor can be set relatively high; and for the service of playing music in Table 1, since the play APP is installed in the general terminal, therefore, The voice object indicates that music is played, and the score factor of the service can be relatively set lower.
  • the legal holder of the terminal uses the highest frequency of the WeChat, so the score factor corresponding to the WeChat can be set to be the highest, and if the voice object indicates to wake up the terminal, the WeChat is opened immediately. Operation, the probability that the voice object is a voice object of the registered voice is higher, and the probability that the wake-up voice signal is valid is higher.
  • the setting of the score factor of the service may be set by the user, or the system may be set according to the statistical data.
  • the scoring factor of the business instruction may be related to the privacy level X k of the service, and the scoring factor of the business instruction is ⁇ k (X k ); or may be related to the historical application frequency Y k of the service, then the scoring factor of the business instruction It is ⁇ k (X k )(Y k ); if the score factor of the business instruction is related to X k and Y k , the score factor of the service instruction is ⁇ k (X k )(Y k ).
  • the background environment can be considered to be noisy.
  • the ideal corpus for training the speech object recognition model should be a quiet environment corpus.
  • the SNR is used to judge the corpus environment, that is, when the SNR is When the value is greater than or equal to the threshold Thd2, the environment in which the voice object wakes up is considered to be a quiet environment.
  • the wake-up speech signal can be used as an additional training sample.
  • S205 Process the valid wake-up speech signal, where the processing includes at least one of the following operations: a noise reduction process and a mute segment process.
  • Processing the valid wake-up speech signal can prepare for the subsequent speech model training.
  • the processing includes: noise reduction processing, voice activation detection, and the like.
  • the noise reduction process is to suppress the noise signal in the voice signal and improve the voice quality;
  • the voice activation detection is to remove the silent segment of the voice signal, and retain the effective voice signal to ensure that the input of the model training only contains the voice signal.
  • FIG. 3 a flow chart of a speech signal denoising process is presented, and a single microphone based noise reduction method can be adopted.
  • the method mainly calculates the probability of existence of the speech signal in each frequency band of the microphone signal, and according to the existence probability of the speech. Different gains are applied to each frequency point to achieve noise reduction.
  • Noise reduction is also implemented based on multi-microphone array noise reduction algorithm, adaptive filtering, signal subspace, neural network and other noise reduction algorithms. These algorithms can achieve the purpose of speech noise reduction.
  • a voice signal VAD detection method is provided.
  • the method is processed according to a frame, and according to the average amplitude of a frame of a sound signal, it is determined whether the frame signal belongs to a voice segment, and the amplitude of the frame signal is If the minimum amplitude is greater than the frame, the frame speech signal is retained; otherwise, the frame speech signal is discarded.
  • Frame_Len is the frame length
  • Eng is the amplitude of the frame signal
  • MIN_Amp is the minimum amplitude of the speech signal sampling point.
  • step S105 is the same as step S105 of the embodiment shown in FIG. 1, and details are not described herein again.
  • a voice object recognition model updating method which combines voice matching and service instruction execution to filter a wake-up voice signal, thereby improving the accuracy of voice object recognition; and adopting a wake-up voice signal as an additional Before the training samples are processed, the wake-up speech signal is processed to improve the accuracy of the speech object recognition model.
  • FIG. 5 is a schematic structural diagram of a voice object recognition model updating apparatus according to an embodiment of the present invention.
  • the apparatus may include: a voice acquiring unit (not shown), a waking unit 12, and an executing unit (not shown). And the model training unit 11; further, the training sample appending unit 13 and the wake-up speech processing unit 14 may also be included.
  • the device may be a terminal device or a processing component in the terminal device, such as a graphics processing unit (GPU), an image processing unit (IPU), or the like.
  • GPU graphics processing unit
  • IPU image processing unit
  • the device receives the matching completed wake-up speech signal, and the acquisition of the voice signal, the matching of the wake-up voice signal, and the execution of the service instruction are all completed by other components of the terminal device, The device only performs training or updating of the speech object recognition model.
  • a voice acquiring unit configured to receive voice of a voice object, and acquire a voice signal in the voice, where the voice signal includes a voice signal that wakes up a voice signal and a service instruction;
  • the waking unit 12 is configured to match the wake-up speech signal acquired by the acquiring unit with a voice object recognition model
  • An execution unit configured to execute at least one service instruction if the matching unit is successfully matched
  • the model training unit 11 is configured to determine, when the execution unit performs the service instruction success, that the wake-up voice signal is added according to a confidence factor corresponding to the wake-up voice signal and a score factor corresponding to the service instruction.
  • the training sample is trained by the training sample to the speech object recognition model.
  • the training sample appending unit 13 is configured to determine whether a sum of a weighting factor of the wake-up voice signal and a weighting factor of the at least one service command is greater than or equal to a set threshold; if the wake-up Determining to use the wake-up speech signal as an additional training sample, the sum of the weighting factors of the speech signal and the weighting factor of the at least one service instruction being greater than or equal to a set threshold
  • the score factor of the service instruction is related to at least one of the following parameters: the privacy of the service, and the historical application frequency of the service.
  • the wake-up speech processing unit 14 is configured to process the valid wake-up speech signal, wherein the processing includes at least one of the following operations: a noise reduction process and a mute segment process.
  • model training unit 11 is further configured to establish the voice object recognition model according to a preset training sample.
  • model training unit 11 is specifically configured to:
  • the speech object recognition model is updated using the modified speech object recognition model.
  • a voice object recognition model updating apparatus filters a wake-up voice signal by combining voice matching and service instruction execution, thereby improving the accuracy of voice object recognition.
  • FIG. 6 is a schematic structural diagram of a voice object recognition model updating device according to an embodiment of the present invention.
  • the device 200 may include: a processor 21, a memory 22 (one or more computer readable storage media), Communication module 23 (optional), input and output system 24. These components can communicate over one or more communication buses 25.
  • the input/output system 24 is mainly used to implement an interactive function between the voice object recognition device 200 and the user/external environment, and mainly includes input and output devices of the voice object recognition device 200.
  • the input and output system 24 can include a touch screen controller 241, an audio controller 242, and a sensor controller 243. Each of the controllers can be coupled with a corresponding peripheral device (the touch screen 244, the audio circuit 245, and the sensor 246.
  • the audio circuit 245, for example, can be a microphone, can receive the voice of the user or the external environment.
  • the sensor 246 can collect Voice of the user or external environment).
  • the audio controller 242 and the sensor controller 243 respectively acquire speech signals in the received or acquired speech. It should be noted that the input and output system 24 may also include other I/O peripherals.
  • the processing module 21 includes one or more central processing units (CPUs) 211, one or more image processors or graphics processors 212, and one or more digital signal processors (digital Signal proccesing, DSP) 213.
  • CPUs central processing units
  • image processors or graphics processors 212 image processors or graphics processors 212
  • DSP digital signal processors
  • Each processor can be integrated to include: one or more processing modules, a clock module, and a power management module.
  • the clock module is primarily used to generate clocks required for data transfer and timing control for the processor.
  • the power management module is mainly used to provide a stable, high-accuracy voltage for the processing module 21, the communication module 23, the input and output system 24, and the like.
  • the DSP 213 is configured to match the acquired wake-up speech signal with the voice object recognition model, for example, perform the steps of S102 or S202 in the foregoing embodiment;
  • the IPU or GPU 212 is configured to determine the wake-up voice signal as an additional training sample, and adopt the training.
  • the sample training voice object recognition model for example, performing the steps of S104 or S204, S206 in the above embodiment, and processing the valid wake-up voice signal, for example, performing the steps of S205 in the above embodiment;
  • the CPU 211 is used to coordinate the memory 22,
  • the operation of the IPU or GPU 212, the DSP 213, the communication module 23, and the input and output system 24, for example, is used to perform a service indication, such as the steps of S103 or S203 in the above embodiment.
  • the processing module 21 may also include only one or more CPUs, and all operations of the above processing modules are performed by the one or more CPUs.
  • the communication module 23 is for receiving and transmitting wireless signals, and mainly integrates the receiver and the transmitter of the voice object recognition device 200.
  • the communication module 23 can include, but is not limited to, a Wi-Fi module and a Bluetooth module.
  • the Wi-Fi module and the Bluetooth module can be used to establish Wi-Fi, Bluetooth and other communication connections with other communication devices, respectively, to achieve close-range data communication.
  • the communication module 23 can be implemented on a separate chip.
  • Memory 22 is coupled to processor 21 for storing various software programs and/or sets of instructions.
  • memory 22 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state storage devices.
  • the memory 22 can store an operating system (hereinafter referred to as a system) such as an embedded operating system such as ANDROID, IOS, WINDOWS, or LINUX.
  • the memory 22 can also store a network communication program that can be used to communicate with one or more terminal devices.
  • the memory 22 can also store a user interface program, which can realistically display the content of the application through a graphical operation interface, and receive user control of the application through input controls such as menus, dialog boxes, and buttons. operating.
  • the wake-up voice signal is filtered by combining the voice matching and the service instruction execution, thereby improving the accuracy of the voice object recognition.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in or transmitted by a computer readable storage medium.
  • the computer instructions may be from a website site, computer, server or data center via a wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) Another website site, computer, server, or data center for transmission.
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a digital versatile disc (DVD)), or a semiconductor medium (eg, a solid state disk (SSD)). )Wait.
  • the foregoing storage medium includes: a read-only memory (ROM) or a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)

Abstract

本申请公开了一种语音对象识别方法及装置。该方法包括:接收语音对象的语音,获取语音中的语音信号,语音信号包括唤醒语音信号和业务指令的语音信号;将唤醒语音信号与语音对象识别模型进行匹配;若匹配成功,则执行该业务指令;当该业务指令执行成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。还公开了相应的装置。通过结合语音匹配和业务指令执行来对唤醒语音信号进行筛选,提高了语音对象识别的准确性。

Description

语音对象识别方法及装置
本申请要求于2017年9月1日提交中国专利局、申请号为CN2017107808785、发明名称为“语音对象识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种语音对象识别方法及装置。
背景技术
语音对象识别或称声纹识别是一种利用人的声音实现的识别技术,由于人在讲话时使用的发声器官存在一定的差异性,任何两个人声音的声纹图谱都有差异,所以声纹可以作为表征个体差异的生物特征,因此可以通过建立识别模型来表征不同的个体,进而利用该识别模型识别不同的个体。语音对象识别技术凭借其低成本、准确和方便等优势,已得到广泛应用。智能终端利用语音对象识别技术,用户进行语音注册后,其他用户就无法唤醒,可以保证用户的隐私。
目前语音对象识别应用中,根据注册语音建立的语音对象识别模型只根据语音对象有限的语料对语音对象语音进行建模,且一旦注册成功,语音对象识别模型就固定不变,语音对象识别的效果对用户注册的语音比较依赖。由于语音对象的语音既与语音对象主观因素如语速、情绪等有关,也与语音对象的身体健康状况等客观因素有关。随着时间的变化,语音对象的这些主、客观因素都会影响语音对象的发音。由于语音对象注册的语音对象识别模型的语料有限,并不能对语音对象语音模型进行充分建模,导致语音对象识别系统识别率难以突破。而声纹训练的语料越长,建立的特征模型当然越精确,识别准确率也就越高,但是这种模型建立的方式的实用性不强。
因此,需要提高语音对象识别的准确性。
发明内容
本申请提供一种语音对象识别方法及装置,以提高语音对象识别的准确性。
本申请的一方面,提供了一种语音对象识别方法,所述方法包括:接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;将所述唤醒语音信号与语音对象识别模型进行匹配;若匹配成功,则执行所述业务指令;当所述业务指令执行成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。在该实现方式中,通过结合语音匹配和业务指令执行来对唤醒语音信号进行筛选,提高了语音对象识别的准确性。
在一种实现方式中,所述方法还包括:判断所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所 述唤醒语音信号作为追加的训练样本。
在该实现方式中,通过具体的计算方式,可以准确地判断唤醒语音信号是否有效。
在另一种实现方式中,所述业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。在该实现方式中,业务的私密性越高,业务指令的得分因子越高;业务的历史应用频率越高,在唤醒终端时,即指示执行该业务,则业务指令的得分因子越高。
在又一种实现方式中,将所述有效的唤醒语音信号作为追加的训练样本之前,所述方法还包括:对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。在该实现方式中,在采用唤醒语音信号作为追加的训练样本前,对唤醒语音信号进行处理,可以提高语音对象识别模型更新的准确性。
在又一种实现方式中,所述采用所述训练样本训练所述语音对象识别模型之前,所述方法还包括:根据预设的训练样本建立所述语音对象识别模型。在该实现方式中,建立语音对象识别模型,以用于后续的语音对象识别。还可以将更新后的识别模型作为识别模型,重复上述步骤,不断地修正、更新识别模型,不断提高识别模型的精确度。
在又一种实现方式中,所述采用所述训练样本训练所述语音对象识别模型,包括:根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。在该实现方式中,通过不断收集语音交互过程中的语料,能够尽量消除用户的各种语调、语速、情绪等因素对于识别模型精确度的偏移,将会大大减少语调、语速、情绪等因素对识别模型精确度的影响,提高语音对象识别的准确性。
本申请的另一方面,提供了一种语音对象识别装置,该语音对象识别装置具有实现上述方法中语音对象识别装置行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。
一种可能的实现方式中,所述语音对象识别装置包括:语音获取单元,用于接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;唤醒单元,用于将所述获取单元获取的所述唤醒语音信号与语音对象识别模型进行匹配;执行单元,用于若所述匹配单元匹配成功,则执行所述业务指令;模型训练单元,用于当所述执行单元执行所述业务指令成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型
另一种可能的实现方式中,所述语音对象识别装置包括:接收器、发射器、存储器和处理器;其中,所述存储器中存储一组程序代码,且所述处理器用于调用所述存储器中存储的程序代码,执行以下操作:接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;将所述唤醒语音信号与语音对象识别模型进行匹配;若匹配成功,则执行所述业务指令;当所述业务指令执行成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。
进一步地,所述处理器还用于执行以下操作:判断所述唤醒语音信号的置信因子和所 述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所述唤醒语音信号作为追加的训练样本。
进一步地,所述业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。
进一步地,所述处理器执行所述将所述有效的唤醒语音信号作为追加的训练样本的操作之前,还执行以下操作:对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。
进一步地,所述处理器执行所述采用所述训练样本训练所述语音对象识别模型的操作之前,还执行以下操作:根据预设的训练样本建立所述语音对象识别模型。
进一步地,所述处理器执行所述采用所述训练样本训练所述语音对象识别模型的操作,包括:根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。
基于同一发明构思,由于该装置解决问题的原理以及有益效果可以参见上述各可能的语音对象识别装置的方法实施方式以及所带来的有益效果,因此该装置的实施可以参见方法的实施,重复之处不再赘述。
本申请的又一方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本申请的又一方面提供了一种通信芯片,其中存储有指令,当其在网络设备或终端设备上运行时,使得计算机执行上述各方面所述的方法。
本申请的又一方面提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
附图说明
为了更清楚地说明本发明实施例或背景技术中的技术方案,下面将对本发明实施例或背景技术中所需要使用的附图进行说明。
图1为本发明实施例提供的一种语音对象识别方法的流程示意图;
图2为对图1所示的一种语音对象识别方法进一步细化的流程示意图;
图3为一种语音信号降噪处理流程示意图;
图4为一种语音激活检测流程示意图;
图5为本发明实施例提供的一种语音对象识别装置的结构示意图;
图6为本发明实施例提供的一种语音对象识别设备的结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。
本申请的一种示例的应用场景为,用户想要用手机给朋友打电话,而目前手机是锁着的,需要解锁,而目前可以采用语音唤醒手机进行解锁。然而,现在进行语音解锁所使用的语料一般比较短,而且用户的语音随着时间的变化,用户的语速、情绪、身体健康状况 等都可能会影响用户的发音,因而,有可能非法获取该用户的手机的语音对象也可能误闯唤醒该手机,或者用户一段时间后,因为自己的发音跟注册语音不一致,导致不能唤醒手机。
本发明实施例提供一种语音对象识别方法及装置、设备,通过结合语音匹配和业务指令执行来对唤醒语音信号进行筛选,用筛选通过的语音信号更新语音对象识别模型,提高了语音对象识别的准确性。
请参阅图1,图1为本发明实施例提供的一种语音对象识别方法的流程示意图,该方法可包括以下步骤:
S101,接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号。
在这里,语音可以为语音对象通过智能终端进行语音聊天或者发出语音指令等产生的音频流,也可以为通过录音等方式获取的音频流等。具体可以是语音对象通过麦克风等语音输入设备输入或语音传感器检测到的音频流。本实施例中,则当语音输入设备接收到语音对象的语音时,获取该语音中的语音信号。需要说明的是,语音可以是一种模拟的语音信号,而本实施例获取的语音信号则是经过模数转换的数字语音信号或电信号。这里的语音对象可以是智能终端的合法持有人,也可以是任意一个持有该智能终端的用户,例如智能终端的非法持有人,或者合法持有人的家人等;该语音对象也可以是机器设备等。
而在该语音信号中包括两种语音信号,即包括唤醒语音信号和业务指令的语音信号。本实施例中的终端可以通过语音指令来指示终端执行业务。在本实施例中,业务指令语音信号可以是一个,也可以是多个,即语音对象可以同时发出多种业务指令。这里,唤醒语音信号应是与语音对象识别模型中的注册语音的语料一致的语音信号,而业务指令语音信号则是指示终端去执行业务指令的语音信号。例如,在上面的应用场景中,用户可以对着手机的麦克风说“你好,小易,请拨打朋友***的电话”,在这个语音信号中,“你好,小易”是唤醒语音信号,而“请拨打朋友***的电话”则是一个业务指令语音信号,即指示终端拨打朋友***的电话。
另外,是否启动声纹学习功能的开关,进行语音对象识别模型更新,用户可以根据需要自行设置。
可选地,在步骤S101之前,还可包括步骤:根据预设的训练样本建立语音对象识别模型。
语音对象识别模型为根据预设的语音信号流的训练样本预先建立的识别模型,即预先提供关联于预设的语音信号流的训练样本,并根据该训练样本训练形成语音对象识别模型。该语音对象识别模型为针对某一对象完成的声纹注册过程后形成的特征模型。且因为本发明实施例提供的方法可以实现对模型进行更新或修正的操作,因此,该语音对象识别模型可以为利用现有方法获取的识别模型,也可以为利用本发明实施例提供的方法进行修正后的识别模型。
S102,将所述唤醒语音信号与语音对象识别模型进行匹配,匹配是否成功?若匹配成功,则进行到步骤S103;否则,跳转到步骤S105,结束流程,终止训练语音对象识别模型。
由于预先建立了语音对象识别模型,因而可以根据语音对象的唤醒语音信号与该语音 对象识别模型的匹配度,来确定是否匹配成功。
具体地,调用声纹确认算法接口,获取该唤醒语音信号与该语音对象识别模型的匹配度。匹配度的计算方式可以为:将唤醒语音信号作为语音对象识别模型的输入值,则获取唤醒语音信号与语音对象识别模型对应的匹配度,或称为对应的概率。该匹配度或概率表示该唤醒语音信号与语音对象识别模型的相关度的大小。若计算得到的匹配度大于或等于预设的匹配度阈值,则认为该唤醒语音信号与语音信号识别模型匹配成功;否则,匹配失败。
若匹配成功,则唤醒了终端。
S103,执行所述业务指令,执行是否成功?若执行成功,则进行到步骤S104;否则,跳转到步骤S105,结束流程,终止训练语音对象识别模型。
终端唤醒后,可执行业务指令。语音信号中可包括一个或多个业务指令,终端可分别执行该一个或多个业务指令。该业务可以是一些预先指定的业务,该指定的业务与后续进行语音对象识别模型的训练有关。
判断业务指令是否执行成功,即判断指示的业务是否完成。例如,如果业务指令语音信号是“拨打朋友***的电话”,则执行成功是指在通讯录中查找到了朋友***的电话号码,并拨打该号码;又例如,如果业务指令语音信号是“播放音乐”,则执行成功是指打开音乐播放器,按照默认设置播放音乐播放器中的音乐。
需要说明的是,若该语音信号中包括多个业务指令的语音信号,则业务指令执行成功的判断是指分别判断各个业务指令是否执行成功。这里判断业务指令是否执行成功,是指判断与确定唤醒语音信号是否可以作为追加的训练样本有关的业务指令是否执行成功,当然,语音信号中也可以包括与确定唤醒语音信号是否可以作为追加的训练样本无关的业务指令本实施例中,若任意一个业务指令执行失败,则可丢弃该唤醒语音信号,终止语音对象识别模型的训练或更新。
S104,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。
具体地,在步骤S104之前,还可以包括以下步骤:判断所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;
若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所述唤醒语音信号作为追加的训练样本。
若唤醒语音信号匹配成功,且业务指令执行成功,则确定唤醒语音信号是否可以作为追加的训练样本。具体地,唤醒语音信号在确定唤醒语音信号是否可以作为追加的训练样本的过程中具有一定的置信度,采用置信度或置信因子表示。该置信度取决于进行唤醒语音信号匹配时的语音对象识别得分,理论上,得分越高,说明唤醒语音是注册语音的语音对象(例如,终端的合法持有人)概率越大,相应的置信度或置信因子越大。现有技术中,一般若能唤醒终端,则必定认为该语音对象就是该终端的合法持有人,但本实施例中,确定唤醒语音信号是否可以作为追加的训练样本还需考虑业务指令的得分因子,以避免不是终端的合法持有人的非法闯入。置信因子和业务指令的得分因子可以是一个大于或等于0 且小于或等于1的数,则本实施例中,置信因子应小于1,因为还需考虑执行业务指令的得分因子。因为若语音对象唤醒终端后,还能指示终端执行某些业务,若该语音对象是该终端的合法持有人,则该语音对象必定对自己持有的终端中安装的业务或存储的内容比较熟悉,所以若指示的业务能执行成功,可以在进行唤醒语音信号的有效性判断时给予一定的得分因子。
业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。即,业务的私密性越好,则得分因子越高;反之,业务的私密性越差,则得分因子越低。同样地,业务的历史应用频率越高,则得分因子越高;反之,业务的历史应用频率越低,则得分因子越低。若某个业务指令执行失败,可以认为该业务指令的得分因子为0,或者进行有效性判断时,不考虑该业务指令。
因而,本实施例通过综合考虑唤醒语音信号的置信因子和业务指令的得分因子,可以提高确定唤醒语音信号是否可以作为追加的训练样本的准确性。
具体地,S104中,采用所述训练样本训练所述语音对象识别模型又可以包括以下步骤:
根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;
采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。
在确定唤醒语音信号可以作为追加的训练样本后,根据该唤醒语音信号以及预设的训练样本,调用声纹注册算法接口,生成修正语音对象识别模型。其中,该预设的训练样本也即为生成上述语音对象识别模型所使用的训练样本。上述修正语音对象识别模型则为更为精确的识别模型,利用该修正语音对象识别模型对上述语音对象识别模型进行更新(例如,将修正语音对象识别模型作为语音对象识别模型进行保存,以替换之前的语音对象识别模型),能够达到模型自适应与智能化的目的。具体地,将唤醒语音信号作为追加的训练样本,也即根据唤醒语音信号以及预设的训练样本,调用声纹注册算法接口,生成修正识别模型。
本实施例使用的语音对象识别模型训练算法可以是基于GMM-UBM的增量训练方法。其它训练方法诸如i-Vector、d-Vector等均可实现语音对象识别模型训练。
此外,还可以将更新后的识别模型作为识别模型,重复上述步骤,不断地修正、更新识别模型,不断提高识别模型的精确度。
由于用户在说话过程或者多人会话等过程中,一般会出现变化较大的语速、语调、情绪波动等,则通过不断收集语音交互过程中的语料,能够尽量消除用户的各种语调、语速、情绪等因素对于识别模型精确度的偏移,将会大大减少语调、语速、情绪等因素对识别模型精确度的影响,也能够降低对声纹识别准确度的影响。
根据本发明实施例提供的一种语音对象识别模型更新方法,通过结合语音匹配和业务指令执行来对唤醒语音进行筛选,提高了语音对象识别模型更新的准确性。
图2为对图1所示的一种语音对象识别模型更新方法进一步细化的流程示意图。以对手机进行唤醒和训练语音对象识别模型为例,该方法可包括以下步骤:
用户通过手机的麦克风输入语音,获取该语音中的语音信号。该语音信号包括唤醒语音信号和业务指令的语音信号。
S201,将唤醒语音信号与语音对象识别模型进行匹配。
S202,若匹配成功,则手机被唤醒,进行到步骤S203;否则,进行到步骤S207,终止训练语音对象识别模型。
该步骤与图1所示实施例的步骤S102相同,在此不再赘述。
S203,执行语音信号中的业务指令。
该步骤与图1所示实施例的步骤S103相同,在此不再赘述。
S204,判断所述唤醒语音信号的置信因子和所述业务指令的得分因子的加权之和是否大于或等于设定阈值,若是,则进行到步骤S205,否则,进行到步骤S207,终止训练语音对象识别模型。
对于执行成功的语音交互业务,将唤醒语音信号的置信因子进行加权,并将至少一个业务指令的得分因子进行加权,判断两者的加权之和是否大于或等于设定阈值。若两者的加权之和大于或等于设定阈值,则可认为该唤醒语音信号可作为语音对象识别模型的追加的训练样本,即训练该语音对象识别模型;否则,终止训练该语音对象识别模型,丢弃该唤醒语音信号。
具体地,可采用以下公式确定唤醒语音信号是否可以作为追加的训练样本:
Figure PCTCN2018085335-appb-000001
其中,α为唤醒语音信号的置信度或置信因子,w s为唤醒语音的置信度权重;β k为第k个语音交互业务的得分因子,w k为第k个语音交互业务的权重,n为选取的手机业务总数,Thd 1为决策门限。
唤醒语音信号的置信度α,取决于唤醒语音信号匹配时的语音对象识别得分,理论上说,得分越高,说明唤醒语音信号是注册语音的语音对象概率越大,相应的置信度值α越大。
语音交互业务得分因子β,这里的业务指手机系统的固定业务,如打电话、发短信、发E-mail等,以及APP业务,如实时通信、第三方电子支付、手游等。
表1是对这些业务的得分因子的举例说明,得分越高,说明唤醒语音信号是注册语音的语音对象概率越大。
表1手机交互业务得分因子
Figure PCTCN2018085335-appb-000002
此外,w s、w k和Thd 1都为常数,在实际应用中可以对这些常数进行调节,以使得系统性能最优。
具体地,业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。关于业务的私密性与业务的得分因子的关系,例如,对于表1中的打电话的业务,因为需要查找通讯录,如果语音对象是终端的合法持有人,则一般清楚自己的通讯录中存储有电话收听方的号码,若能成功地拨打对方电话,则该得分因子可以设置得比较高;而对于表1中的播放音乐的业务,因为一般的终端中都安装了播放APP,因此,语音对象指示播放音乐,该业务的得分因子可以相对设置得较低。关于业务的历史应用频率与业务的得分因子的关系,例如,终端的合法持有人使用微信的频率最高,因而可设置微信对应的得分因子最高,若语音对象指示唤醒终端后立即执行打开微信的操作,则该语音对象是注册语音的语音对象的概率就较高,该唤醒语音信号有效的概率就越高。当然,业务的得分因子的设置可以是用户自行设置的,或者是系统根据统计的数据设置的。
从而,业务指令的得分因子可以与业务的私密性等级X k有关,则业务指令的得分因子为β k(X k);也可以与业务的历史应用频率Y k有关,则业务指令的得分因子为β k(X k)(Y k);若业务指令的得分因子与X k、Y k均相关,则业务指令的得分因子为β k(X k)(Y k)。
此外,在判断唤醒语音信号是否有效时,还可考虑背景环境嘈杂程度,理想的用于训练语音对象识别模型的语料应该为安静环境语料,本实施例采用SNR对语料环境进行判断,即当SNR值大于或等于门限值Thd2时,认为语音对象唤醒的环境为安静环境。
从图2及前面的描述可以看出,虚线框内是确定唤醒语音信号是否可以作为追加的训练样本。
S205,对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。
对有效的唤醒语音信号进行处理,可以为后面的语音模型训练做准备。该处理包括:降噪处理和语音激活检测等。降噪处理是为了抑制语音信号中的噪声信号,提高语音质量;语音激活检测是为了去除语音信号的静音段,保留有效的语音信号,保证模型训练的输入只含有语音信号。
例如,如图3所示,给出了一种语音信号降噪处理流程示意图,可采用基于单麦克风的降噪方法,该方法主要通过计算麦克信号各频段语音存在概率,并根据语音存在概率大小对各个频点施加不同增益从而达到降噪目的。实现语音降噪还有基于多麦克风的阵列降噪算法、自适应滤波、信号子空间、神经网络等降噪算法,这些算法都可以实现语音降噪的目的。
如图4所示,给出了一种语音信号VAD检测方法,该方法是按帧处理的,根据一帧声音信号的平均幅度来判断该帧信号是否属于语音段,对于该帧信号的幅度和大于该帧的最小幅度的,则保留该帧语音信号;否则,丢弃该帧语音信号。其中,Frame_Len为帧长,Eng为该帧信号的幅度和,MIN_Amp为语音信号采样点最小幅度。
S206,进行模型增量训练,获得新的语音对象识别模型。
该步骤与图1所示实施例的步骤S105相同,在此不再赘述。
根据本发明实施例提供的一种语音对象识别模型更新方法,通过结合语音匹配和业务 指令执行来对唤醒语音信号进行筛选,提高了语音对象识别的准确性;且在采用唤醒语音信号作为追加的训练样本前,对唤醒语音信号进行处理,可以提高语音对象识别模型的准确性。
上述详细阐述了本发明实施例的方法,下面提供了本发明实施例的装置。
请参阅图5,图5为本发明实施例提供的一种语音对象识别模型更新装置的结构示意图,该装置可以包括:语音获取单元(未示出)、唤醒单元12、执行单元(未示出)和模型训练单元11;进一步地,还可以包括训练样本追加单元13和唤醒语音处理单元14。该装置可以是终端设备,也可以是终端设备中的一个处理部件,例如图形处理器(graphic processing unit,GPU)、图像处理器(image processing unit,IPU)等。若该装置为终端设备中的一个处理部件,则该装置接收的是匹配完成的唤醒语音信号,语音信号的获取、唤醒语音信号的匹配以及业务指令的执行都由终端设备的其他部件完成,该装置只进行语音对象识别模型的训练或更新。
语音获取单元,用于接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;
唤醒单元12,用于将所述获取单元获取的所述唤醒语音信号与语音对象识别模型进行匹配;
执行单元,用于若所述匹配单元匹配成功,则执行至少一个业务指令;
模型训练单元11,用于当所述执行单元执行所述业务指令成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。
在一种实现方式中,训练样本追加单元13,用于判断所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所述唤醒语音信号作为追加的训练样本
其中,业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。
在另一种实现方式中,唤醒语音处理单元14,用于对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。
在又一种实现方式中,所述模型训练单元11还用于根据预设的训练样本建立所述语音对象识别模型。
在又一种实现方式中,所述模型训练单元11具体用于:
根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;
采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。
根据本发明实施例提供的一种语音对象识别模型更新装置,通过结合语音匹配和业务指令执行来对唤醒语音信号进行筛选,提高了语音对象识别的准确性。
请参阅图6,图6为本发明实施例提供的一种语音对象识别模型更新设备的结构示意图,该设备200可以包括:处理器21、存储器22(一个或多个计算机可读存储介质)、通信模块23(可选)、输入输出系统24。这些部件可在一个或多个通信总线25上通信。
输入输出系统24主要用于实现语音对象识别设备200和用户/外部环境之间的交互功能,主要包括语音对象识别设备200的输入输出装置。具体实现中,输入输出系统24可包括触摸屏控制器241、音频控制器242以及传感器控制器243。其中,各个控制器可与各自对应的外围设备(触摸屏244、音频电路245以及传感器246耦合。具体实现中,音频电路245,例如可以是麦克风,可接收用户或外部环境的语音。传感器246可采集用户或外部环境的语音)。音频控制器242以及传感器控制器243分别获取接收或采集到的语音中的语音信号。需要说明的,输入输出系统24还可以包括其他I/O外设。
在一种实现方式中,处理模块21包括一个或多个处理器(central processing unit,CPU)211、一个或多个图像处理器或图形处理器212、以及一个或多个数字信号处理器(digital signal proccesing,DSP)213。每个处理器可集成包括:一个或多个处理模块、时钟模块以及电源管理模块。所述时钟模块主要用于为处理器产生数据传输和时序控制所需要的时钟。所述电源管理模块主要用于为处理模块21、通信模块23以及输入输出系统24等提供稳定的、高精确度的电压。具体地,DSP213用于将获取的唤醒语音信号与语音对象识别模型进行匹配,例如执行上述实施例中S102或S202的步骤;IPU或GPU212用于确定唤醒语音信号作为追加的训练样本,采用该训练样本训练语音对象识别模型,例如执行上述实施例中S104或S204、S206的步骤,以及用于对有效的唤醒语音信号进行处理,例如执行上述实施例中S205的步骤;CPU211用于协调存储器22、IPU或GPU212、DSP213、通信模块23以及输入输出系统24的工作,例如用于执行业务指示,例如执行上述实施例中S103或S203的步骤。
在另一种替换的实现方式中,处理模块21也可以仅包括一个或多个CPU,以上处理模块的所有操作都由这一个或多个CPU来完成。
通信模块23用于接收和发送无线信号,主要集成了语音对象识别设备200的接收器和发射器。具体实现中,通信模块23可包括但不限于:Wi-Fi模块、蓝牙模块。Wi-Fi模块、蓝牙模块可分别用于与其他通信设备建立Wi-Fi、蓝牙等通信连接,以实现近距离的数据通信。在一些实施例中,可在单独的芯片上实现通信模块23。
存储器22与处理器21耦合,用于存储各种软件程序和/或多组指令。具体实现中,存储器22可包括高速随机存取的存储器,并且也可包括非易失性存储器,例如一个或多个磁盘存储设备、闪存设备或其他非易失性固态存储设备。存储器22可以存储操作系统(下述简称系统),例如ANDROID,IOS,WINDOWS,或者LINUX等嵌入式操作系统。存储器22还可以存储网络通信程序,该网络通信程序可用于与一个或多个终端设备进行通信。存储器22还可以存储用户接口程序,该用户接口程序可以通过图形化的操作界面,将应用程序的内容形象逼真的显示出来,并通过菜单、对话框以及按键等输入控件接收用户对应用程序的控制操作。
根据本发明实施例提供的一种语音对象识别模型更新设备,通过结合语音匹配和业务指令执行来对唤醒语音信号进行筛选,提高了语音对象识别的准确性。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可 以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,数字通用光盘(digital versatile disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:只读存储器(read-only memory,ROM)或随机存储存储器(random access memory,RAM)、磁碟或者光盘等各种可存储程序代码的介质。

Claims (12)

  1. 一种语音对象识别方法,其特征在于,所述方法包括:
    接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;
    将所述唤醒语音信号与语音对象识别模型进行匹配;
    若匹配成功,则执行所述业务指令;
    当所述业务指令执行成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    判断所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;
    若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所述唤醒语音信号作为追加的训练样本。
  3. 如权利要求1或2所述的方法,其特征在于,所述业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。
  4. 如权利要求1~3任一项所述的方法,其特征在于,将所述有效的唤醒语音信号作为追加的训练样本之前,所述方法还包括:
    对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。
  5. 如权利要求1~4任一项所述的方法,其特征在于,所述采用所述训练样本训练所述语音对象识别模型之前,所述方法还包括:
    根据预设的训练样本建立所述语音对象识别模型。
  6. 如权利要求5所述的方法,其特征在于,所述采用所述训练样本训练所述语音对象识别模型,包括:
    根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;
    采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。
  7. 一种语音对象识别装置,其特征在于,所述装置包括:
    语音获取单元,用于接收语音对象的语音,获取所述语音中的语音信号,其中,所述语音信号包括唤醒语音信号和业务指令的语音信号;
    唤醒单元,用于将所述获取单元获取的所述唤醒语音信号与语音对象识别模型进行匹配;
    执行单元,用于若所述匹配单元匹配成功,则执行所述业务指令;
    模型训练单元,用于当所述执行单元执行所述业务指令成功时,若根据所述唤醒语音信号对应的置信因子和所述业务指令对应的得分因子,确定将所述唤醒语音信号作为追加的训练样本,则采用所述训练样本训练所述语音对象识别模型。
  8. 如权利要求7所述的装置,其特征在于,所述装置还包括:
    训练样本追加单元,用于判断所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和是否大于或等于设定阈值;若所述唤醒语音信号的置信因子和所述至少一个业务指令的得分因子的加权之和大于或等于设定阈值,则确定将所述唤醒语音信号作为追加的训练样本。
  9. 如权利要求7或8所述的装置,其特征在于,所述业务指令的得分因子与以下至少一个参数有关:业务的私密性、业务的历史应用频率。
  10. 如权利要求7~9任一项所述的装置,其特征在于,所述装置还包括:
    唤醒语音信号处理单元,用于对所述有效的唤醒语音信号进行处理,其中,所述处理包括以下至少一种操作:降噪处理和去除静音段处理。
  11. 如权利要求7~10任一项所述的装置,其特征在于,所述模型训练单元还用于根据预设的训练样本建立所述语音对象识别模型。
  12. 如权利要求11所述的装置,其特征在于,所述模型训练单元具体用于:
    根据所述有效的唤醒语音信号以及所述预设的训练样本,生成修正语音对象识别模型;
    采用所述修正语音对象识别模型对所述语音对象识别模型进行更新。
PCT/CN2018/085335 2017-09-01 2018-05-02 语音对象识别方法及装置 WO2019041871A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710780878.5A CN109427336B (zh) 2017-09-01 2017-09-01 语音对象识别方法及装置
CN201710780878.5 2017-09-01

Publications (1)

Publication Number Publication Date
WO2019041871A1 true WO2019041871A1 (zh) 2019-03-07

Family

ID=65512952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/085335 WO2019041871A1 (zh) 2017-09-01 2018-05-02 语音对象识别方法及装置

Country Status (2)

Country Link
CN (1) CN109427336B (zh)
WO (1) WO2019041871A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288997B (zh) * 2019-07-22 2021-04-16 苏州思必驰信息科技有限公司 用于声学组网的设备唤醒方法及系统
CN110706707B (zh) * 2019-11-13 2020-09-18 百度在线网络技术(北京)有限公司 用于语音交互的方法、装置、设备和计算机可读存储介质
CN112489648B (zh) * 2020-11-25 2024-03-19 广东美的制冷设备有限公司 唤醒处理阈值调整方法、语音家电、存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (zh) * 2012-07-09 2012-10-31 华为终端有限公司 一种声纹特征模型更新方法及终端
CN105976813A (zh) * 2015-03-13 2016-09-28 三星电子株式会社 语音识别系统及其语音识别方法
US9679569B2 (en) * 2014-06-24 2017-06-13 Google Inc. Dynamic threshold for speaker verification
CN106887231A (zh) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 一种识别模型更新方法及系统以及智能终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760434A (zh) * 2012-07-09 2012-10-31 华为终端有限公司 一种声纹特征模型更新方法及终端
US9679569B2 (en) * 2014-06-24 2017-06-13 Google Inc. Dynamic threshold for speaker verification
CN105976813A (zh) * 2015-03-13 2016-09-28 三星电子株式会社 语音识别系统及其语音识别方法
CN106887231A (zh) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 一种识别模型更新方法及系统以及智能终端

Also Published As

Publication number Publication date
CN109427336B (zh) 2020-06-16
CN109427336A (zh) 2019-03-05

Similar Documents

Publication Publication Date Title
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
US11322157B2 (en) Voice user interface
EP3210205B1 (en) Sound sample verification for generating sound detection model
WO2021139327A1 (zh) 一种音频信号处理方法、模型训练方法以及相关装置
WO2017031846A1 (zh) 噪声消除、语音识别方法、装置、设备及非易失性计算机存储介质
TW201905675A (zh) 資料更新方法、客戶端及電子設備
WO2014114048A1 (zh) 一种语音识别的方法、装置
WO2014114049A1 (zh) 一种语音识别的方法、装置
CN110600048B (zh) 音频校验方法、装置、存储介质及电子设备
US11626104B2 (en) User speech profile management
WO2019041871A1 (zh) 语音对象识别方法及装置
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
US20230005480A1 (en) Voice Filtering Other Speakers From Calls And Audio Messages
KR20200025226A (ko) 전자 장치 및 그 제어 방법
CN108847243B (zh) 声纹特征更新方法、装置、存储介质及电子设备
KR20200007530A (ko) 사용자 음성 입력 처리 방법 및 이를 지원하는 전자 장치
CN110689887A (zh) 音频校验方法、装置、存储介质及电子设备
WO2021169711A1 (zh) 指令执行方法、装置、存储介质及电子设备
CN112951243A (zh) 语音唤醒方法、装置、芯片、电子设备及存储介质
US10818298B2 (en) Audio processing
CN113241059B (zh) 语音唤醒方法、装置、设备及存储介质
CN110197663A (zh) 一种控制方法、装置及电子设备
CN114121022A (zh) 语音唤醒方法、装置、电子设备以及存储介质
CN114495981A (zh) 语音端点的判定方法、装置、设备、存储介质及产品
CN112509556B (zh) 一种语音唤醒方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18850353

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18850353

Country of ref document: EP

Kind code of ref document: A1