WO2019210796A1 - 语音识别方法、装置、存储介质及电子设备 - Google Patents

语音识别方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2019210796A1
WO2019210796A1 PCT/CN2019/084131 CN2019084131W WO2019210796A1 WO 2019210796 A1 WO2019210796 A1 WO 2019210796A1 CN 2019084131 W CN2019084131 W CN 2019084131W WO 2019210796 A1 WO2019210796 A1 WO 2019210796A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
training
voice
recording
pronunciation
Prior art date
Application number
PCT/CN2019/084131
Other languages
English (en)
French (fr)
Inventor
陈岩
刘耀勇
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019210796A1 publication Critical patent/WO2019210796A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Definitions

  • the present application relates to the field of mobile communications technologies, and in particular, to the field of mobile device technologies, and in particular, to a voice recognition method, apparatus, storage medium, and electronic device.
  • the embodiment of the present invention provides a voice recognition method, device, storage medium, and electronic device, which can recognize the pronunciation of a real person, prevent others from using the recording or vocal synthesis to perform security breach and improve security.
  • the embodiment of the present application provides a voice recognition method, which is applied to an electronic device, where the method includes:
  • the training sample including a human voice sample and a non-real person pronunciation sample
  • test voice When the test voice is received, the test voice is subjected to biometric detection by the living body detection model to generate a prediction result;
  • Determining whether to perform voiceprint recognition on the test speech is performed according to the prediction result.
  • the embodiment of the present application further provides a voice recognition device, where the device includes:
  • An obtaining module configured to acquire a training sample, where the training sample includes a real human pronunciation sample and a non-real human pronunciation sample;
  • An extraction module configured to extract feature information in the training sample
  • a training module configured to input the training sample and the feature information into the reference model as training data, to obtain an optimized parameter of the reference model after training;
  • Generating a module configured to generate a living body detection model according to the optimization parameter
  • a detecting module configured to perform a living body detection on the test voice by using the living body detection model to generate a prediction result when the test voice is received;
  • an identifying module configured to determine, according to the prediction result, whether to perform voiceprint recognition on the test voice.
  • the embodiment of the present application further provides a storage medium on which a computer program is stored, and when the computer program runs on a computer, causes the computer to execute the voice recognition method as described above.
  • an embodiment of the present application further provides an electronic device, including a memory and a processor, where the processor is configured to perform the steps by calling a computer program stored in the memory:
  • the training sample including a human voice sample and a non-real person pronunciation sample
  • test voice When the test voice is received, the test voice is subjected to biometric detection by the living body detection model to generate a prediction result;
  • Determining whether to perform voiceprint recognition on the test speech is performed according to the prediction result.
  • FIG. 1 is a schematic diagram of a system of a voice recognition apparatus according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application scenario of a voice recognition device according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart of a voice recognition method according to an embodiment of the present disclosure.
  • FIG. 4 is another schematic flowchart of a voice recognition method according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice recognition apparatus according to an embodiment of the present application.
  • FIG. 6 is another schematic structural diagram of a voice recognition apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 8 is another schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • the electronic device when a user identity authentication such as waking up or unlocking an electronic device such as a smart phone is used, the electronic device usually provides a voiceprint recognition algorithm, but the voiceprint recognition has certain security in the recording and playback and the security protection of the voice synthesis. problem.
  • the electronic device may be a smart phone, a tablet computer, a desktop computer, a notebook computer, or a handheld computer.
  • FIG. 1 is a schematic diagram of a system for a voice recognition apparatus according to an embodiment of the present application.
  • the speech recognition apparatus is mainly configured to: acquire training samples in advance, and extract feature information in the training samples, and input the training samples and the feature information into the reference model as training data to perform training, so as to obtain the reference after training.
  • An optimization parameter of the model, a living body detection model is generated according to the optimization parameter, and when the test voice is received, the test voice is subjected to a living body detection by the living body detection model to generate a prediction result, and whether the prediction result is determined according to the prediction result
  • the voiceprint recognition of the test voice can accurately identify the human voice, so as to prevent others from using the recording or vocal synthesis to perform security breaches and improve the security of the device.
  • FIG. 2 is a schematic diagram of an application scenario of a voice recognition device according to an embodiment of the present application.
  • the voice recognition device when receiving the test voice input by the user, the voice recognition device inputs the feature information of the test voice into the living body detection model for living body detection to generate a prediction result, and if the prediction result is a real person pronunciation, the test is performed on the test.
  • the voice is voice-coded to implement the user's identity authentication.
  • the electronic device is unlocked.
  • the identity authentication fails, the authentication fails and the locked state is maintained. If the predicted result is a non-real person pronunciation, the authentication of the test voice is prohibited, and a voice prompt or a text prompt “non-real person pronunciation, prohibition of authentication” may be issued, as shown by state C in FIG. 2 .
  • An execution body of a voice recognition method provided by an embodiment of the present application may be a voice recognition device provided by an embodiment of the present application, or an electronic device integrated with the voice recognition device (such as a palmtop computer, a tablet computer, or a smart phone).
  • the speech recognition device can be implemented in hardware or software.
  • the embodiment of the invention provides a voice recognition method, including:
  • the training sample including a human voice sample and a non-real person pronunciation sample
  • test voice When the test voice is received, the test voice is subjected to biometric detection by the living body detection model to generate a prediction result;
  • Determining whether to perform voiceprint recognition on the test speech is performed according to the prediction result.
  • the step of determining whether to perform voiceprint recognition on the test voice according to the prediction result may include: if the prediction result is a real human voice, determining to perform voiceprint recognition on the test voice To implement identity authentication of the user; or if the prediction result is non-real person pronunciation, it is determined that voiceprint recognition is not performed on the test voice.
  • the step of acquiring a training sample may include: collecting a human voice and marking the human voice sample; collecting a non-real person pronunciation, and marking the non-real person pronunciation sample, wherein the non-real person
  • the pronunciation sample includes a live-action recording sub-sample and a synthetic vocal recording sub-sample.
  • the step of collecting a non-real person pronunciation and marking the non-real person pronunciation sample may include: recording and collecting the real person pronunciation sample, and marking the real person in the non-real person pronunciation sample Recording subsamples; recording and collecting synthetic vocal pronunciations, and marking the synthesized vocal recording subsamples in the non-real human pronunciation samples.
  • the step of extracting feature information in the training sample may include: separately extracting a sound spectrum corresponding to the real human voice sample, the live voice recording subsample, and the synthesized voice recording subsample.
  • the step of training the training sample and the feature information as training data into a reference model to obtain optimized parameters of the reference model after training may include: The sound sample map corresponding to the pronunciation sample, the live-action recording sub-sample and the synthetic vocal recording sub-sample respectively is trained as a training data input reference model to obtain optimized parameters of the reference model after training.
  • the step of obtaining the optimized parameter of the reference model after the training may include: starting timing; acquiring a difference feature value between the real person pronunciation sample and the real person recording subsample, to obtain a first optimization parameter, and acquiring a difference feature value between the human voice pronunciation sample and the synthesized voice recording subsample to obtain a second optimization parameter.
  • the step of generating a living body detection model according to the optimization parameter may include: generating a living body detection model according to the first optimization parameter and the second optimization parameter.
  • the step of obtaining the optimized parameter of the reference model after training may include: sounding a real human pronunciation sample, a real person recording subsample, and a synthesized human voice recording subsample in the training sample.
  • the spectrum is input as a training data input convolution layer to obtain a first intermediate value; the first intermediate value is input to the fully connected layer to obtain a second intermediate value; and the second intermediate value is input to the classifier to obtain a probability corresponding to the plurality of prediction results;
  • the prediction result and the plurality of probabilities corresponding thereto obtain the loss value; the training is performed according to the loss value to obtain the optimization parameter.
  • FIG. 3 to FIG. 4 are schematic flowcharts of a voice recognition method according to an embodiment of the present application.
  • the method is applied to an electronic device, the method comprising:
  • Step 101 Acquire a training sample, where the training sample includes a real human pronunciation sample and a non-real human pronunciation sample.
  • step 101 can be implemented by step 1011 and step 1012, specifically:
  • step 1011 the human voice is collected and labeled as the real person pronunciation sample.
  • Step 1012 Collect a non-real person pronunciation and mark the non-real person pronunciation sample, wherein the non-real person pronunciation sample includes a real person recording subsample and a synthetic vocal recording subsample.
  • the collecting a non-real person pronunciation and marking the non-real person pronunciation sample includes:
  • the synthesized vocal pronunciation is recorded and recorded as a synthesized vocal recording subsample in the non-human vocal sample.
  • a recording device such as a microphone in an electronic device such as a mobile phone first collects a human-original pronunciation input by the user and marks it as a real-life pronunciation sample, and then records and records the recorded real-life pronunciation sample or the synthesized vocal pronunciation and marks the non-real person pronunciation sample.
  • the training sample may be a sample set M, and the sample set M includes a plurality of sample sets m.
  • the training samples that are closer to the voice information may be selected on the selected training samples.
  • Training for example, each sample group may include a set of real human voice samples with the same voice content, a live voice subsample and a synthetic voice recording subsample.
  • the training sample comprises a sample set M, M comprising a plurality of sample sets ⁇ m1, m2, m3..., mn ⁇ , wherein the first sample set m1 comprises ⁇ x1, y1, z1 ⁇ , wherein x1 represents user input
  • the voice content is a live-action recording sample of "Today's good weather”
  • y1 indicates that the live-action recording sample whose voice content is "good weather today” is played back by the electronic device and then recorded by the recording device
  • z1 indicates that the voice content is " A good vocal recording subsample of today's good weather.
  • Step 102 Extract feature information in the training sample.
  • Each of the sounds has unique feature information, and the feature information can effectively distinguish different people's voices.
  • this unique characteristic information is mainly determined by two factors.
  • the first one is the size of the acoustic cavity, including the throat, nasal cavity and oral cavity.
  • the shape, size and position of these organs determine the tension of the vocal cord and The range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of sounds is different, and it sounds low and bright.
  • everyone's voice cavity is different, just like a fingerprint, each person's voice has unique feature information.
  • the second factor that determines the characteristics of the sound is the way the vocal organ is manipulated.
  • the vocal organs include the lips, teeth, tongue, soft palate, and tendon muscles. The interaction between them produces a clear voice. The way they collaborate is that people learn randomly through the communication with the people around them.
  • the wavelength, frequency, intensity, rhythm, timbre of a sound, or frequency, phase, amplitude, etc. in a spectrogram can reflect the difference between different sounds.
  • the live-speech samples, the live-action recording sub-samples, and the spectrogram corresponding to the synthesized vocal recording sub-samples may be separately extracted.
  • the sound spectrum map is used as feature information corresponding to the training sample.
  • each training sample is converted into a corresponding spectrogram, and the spectrogram is used to embody the feature information of the training sample.
  • Step 103 The root trains the training sample and the feature information as training data into a reference model to obtain an optimized parameter of the reference model after training.
  • the sound spectrum map corresponding to the real human pronunciation sample, the live human voice recording subsample, and the synthesized human voice recording subsample are respectively input as training data into a reference model for training to obtain a trained position.
  • the optimization parameters of the reference model are described.
  • the reference model can select a convolutional neural network model.
  • a convolutional neural network model can be a hidden Markov model, a Gaussian mixture model, and the like.
  • the convolutional neural network model includes a convolutional layer, a fully connected layer, and a classifier connected in sequence.
  • the convolutional neural network mainly includes a network structure part and a network training part, wherein the network structure part comprises a convolution layer and a full connection layer connected in sequence.
  • An excitation layer and a pooling layer may also be included between the convoluted layer and the fully connected layer.
  • the network structure part of the convolutional neural network model may include a five-layer network, the first three layers are convolution layers, the convolution kernel size is unified to 3 ⁇ 3, and the sliding step length is unified to 1, due to the small dimension, The pooling layer is not used, and the latter two layers are fully connected layers, which are 20 neurons and 2 neurons, respectively.
  • the network structure part may further include other layers of convolution layers, such as a 3-layer convolution layer, a 7-layer convolution layer, a 9-layer convolution layer, etc., and may also include a full-connection layer of other layers. Such as a 1-layer fully connected layer, a 3-layer fully connected layer, and the like. It is also possible to increase the pooling layer or not to use the pooling layer.
  • the convolution kernel size can be other sizes, such as 2 x 2. Convolution kernels of different sizes can also be used for different convolutional layers. For example, the first layer convolution layer uses a 3 ⁇ 3 convolution kernel, and the other layer convolution layer uses a 2 ⁇ 2 convolution kernel.
  • the sliding step size can be unified to 2 or other values, or a different sliding step size can be used, such as a sliding step of 2 for the first layer and a sliding step of 1 for the other layers.
  • the training method can include the following steps:
  • the first intermediate value is obtained by inputting a spectrogram corresponding to the real human pronunciation sample, the real person recording subsample and the synthesized vocal recording subsample in the training sample as training data into the convolutional layer.
  • the second intermediate value is input to the classifier to obtain a probability corresponding to the plurality of prediction results.
  • the probability that the prediction result is obtained may be based on the first preset formula, and the second intermediate value is input to the classifier to obtain a probability corresponding to the plurality of prediction results, where the first preset formula is:
  • ZK is the second intermediate value of the target
  • C is the number of categories of the prediction result
  • Zj is the jth second intermediate value
  • a loss value is obtained based on a plurality of prediction results and a plurality of probabilities corresponding thereto.
  • the obtained loss value may be based on the second preset formula, and the loss value is obtained according to the multiple prediction results and the multiple probabilities corresponding thereto, where the second preset formula is:
  • the random gradient descent method can be used for training according to the loss value. Training can also be performed according to a batch gradient descent method or a gradient descent method.
  • Training is performed using the stochastic gradient descent method, and the training can be completed when the loss value is equal to or less than the preset loss value. It is also possible to complete the training when there are no changes in the two or more loss values continuously acquired. Of course, it is also possible to directly set the number of iterations of the random gradient descent method according to the loss value. After the number of iterations is completed, the training is completed. After the training is completed, each parameter of the reference model at this time is obtained, and the each parameter is saved as an optimization parameter, and when the prediction is needed later, the optimization parameter is used for prediction.
  • the obtained loss value may be obtained according to the third preset formula according to the plurality of sets of parameters, and each set of parameters includes a plurality of prediction results and a plurality of probability corresponding to the obtained loss values, wherein the third preset formula is:
  • C is the number of categories of prediction results
  • y k is the true value
  • E is the average.
  • the optimal parameters can be trained in a small batch manner. If the batch size is 128, E in the third preset formula is expressed as the average of 128 loss values.
  • multiple sample sets may be acquired first, and multiple sample sets are constructed into multiple two-dimensional sound spectrum images, and then multiple sound spectrum images are input as training data into the reference model to obtain multiple loss values, and then multiple The average of the loss values.
  • the optimization parameter is used to represent a difference feature value between a human voice and a non-real person pronunciation, and the optimization parameter can effectively distinguish between a real person pronunciation and a non-real person pronunciation.
  • the obtaining the optimized parameters of the reference model after training includes:
  • the first optimization parameter can effectively distinguish between real human voice and real person recording.
  • the second optimization parameter can effectively distinguish between real human pronunciation and synthetic vocal recording.
  • the difference between the low frequency voice signal and/or the high frequency voice signal is more obvious.
  • the live voice signal in the low frequency voice signal is missing compared to the real human voice.
  • the information for example, the waveform of the audio signal in the human voice signal is reduced in a certain phase, and the degree of waveform reduction in the phase can be regarded as the difference characteristic value between the real person pronunciation sample and the real person recording subsample.
  • the training sample can select more high frequency samples or low frequency samples for training to obtain better optimization parameters, that is, training by inputting the training samples into the reference model.
  • the optimization parameters are obtained through the deep learning of the reference model, and no manual is needed from the input to the output. Participation is done by a reference model that can be learned in depth.
  • Step 104 Generate a living body detection model according to the optimization parameter.
  • the living body detection model relies on the optimization parameters obtained by the above training, and the living body detection model can effectively detect the human voice and the non-real person pronunciation by using the optimization parameter.
  • a biometric detection model is generated based on the first optimization parameter and the second optimization parameter.
  • the living body detection model relies on the optimization parameters obtained by the above training, and the living body detection model can effectively distinguish between the human voice and the real person recording by using the first optimization parameter, and can effectively distinguish the true human pronunciation and the second optimization parameter by using the second optimization parameter. Synthetic vocal recording.
  • Step 105 When receiving the test voice, perform live detection on the test voice by using the living body detection model to generate a prediction result.
  • the test voice When the test voice is received, the test voice may be a real person pronunciation, or may be a non-real person pronunciation such as a recording.
  • the test voice is subjected to the living body detection by the living body detection model, and the characteristic information of the test voice is combined. And the optimization parameter in the living body detection model is detected, and a prediction result with higher accuracy can be generated.
  • the prediction result may include two results of a human voice and a non-real person pronunciation.
  • the prediction results may also include three results of real human pronunciation, real person recording and synthetic human voice.
  • Step 106 Determine, according to the prediction result, whether voiceprint recognition is performed on the test voice.
  • the prediction result is a real person pronunciation, determining that voiceprint recognition is performed on the test voice to implement identity authentication of the user; or if the prediction result is non-real person pronunciation, determining not to perform voiceprint on the test voice Identification.
  • the test voice is input into the voiceprint recognition system for voiceprint recognition to implement the user's identity authentication, for example, the test voice and the stored voice in the voiceprint recognition library.
  • the user's identity authentication for example, the test voice and the stored voice in the voiceprint recognition library.
  • the authentication of the test voice is prohibited, and a voice prompt or a text prompt may be issued to remind the user that the test voice is a non-real person pronunciation, and there may be a security risk.
  • a prompt message may also be sent to other user equipments or user mailboxes bound to the current device to prompt the user that the current device is currently in a situation of being illegally authenticated by others.
  • the current device may also enter a self-protection mode, and the self-protection mode may include changing an unlocking manner, such as changing the unlocking mode from voiceprint unlocking to voiceprint recognition and face recognition.
  • the self-protection mode may include an automatic shutdown function.
  • the self-protection mode may include automatically hiding private information in the current device, such as hiding a folder marked as private information, or hiding an application including payment or financial management functions, or hiding a live chat application, which may more advantageously protect the user. Information security.
  • the training process of the reference model can be on the server side or on the electronic device side.
  • the training process and the actual prediction process of the reference model are completed on the server side, when the optimized reference model is needed to generate the living body detection model, the test voice and the feature information corresponding to the test voice can be input to the server, and the server actually predicts After the completion, the prediction result is sent to the electronic device end, and the electronic device selects whether to enter the next identity authentication according to the prediction result.
  • the optimized reference model is used to generate the living body detection model
  • the test voice and the feature information corresponding to the test voice can be input to the electronic device, and the electronic After the actual prediction of the device is completed, the electronic device selects whether to enter the next identity authentication according to the predicted result.
  • the test voice and the feature information corresponding to the test voice can be After inputting to the electronic device, after the actual prediction of the electronic device is completed, the electronic device selects whether to enter the next identity authentication according to the predicted result.
  • the trained living body detection model file model file
  • the electronic device selects whether to enter the next identity authentication according to the predicted result.
  • the trained living body detection model file (model file) can be transplanted to the smart device. If the input test voice needs to be detected in vivo, the test voice input is input to the trained living body detection model file (model file). ), the calculation can get the prediction result.
  • the embodiment of the present application obtains a training sample, where the training sample includes a real human pronunciation sample and a non-real human pronunciation sample, extracts feature information in the training sample, and inputs the training sample and the feature information as training data into a reference model.
  • the model training is performed by using the labeled real human pronunciation sample and the non-real human pronunciation sample, and the optimized parameters obtained by the current model are reincorporated into the voiceprint recognition system for voiceprint recognition, and the true human voice can be accurately recognized. Prevent others from using the recording or vocal synthesis for security breaches and improve the security of the device.
  • An embodiment of the present invention provides a voice recognition apparatus, including:
  • An obtaining module configured to acquire a training sample, where the training sample includes a real human pronunciation sample and a non-real human pronunciation sample;
  • An extraction module configured to extract feature information in the training sample
  • a training module configured to input the training sample and the feature information into the reference model as training data, to obtain an optimized parameter of the reference model after training;
  • Generating a module configured to generate a living body detection model according to the optimization parameter
  • a detecting module configured to perform a living body detection on the test voice by using the living body detection model to generate a prediction result when the test voice is received;
  • an identifying module configured to determine, according to the prediction result, whether to perform voiceprint recognition on the test voice.
  • the identifying module is configured to: if the prediction result is a real human pronunciation, determine to perform voiceprint recognition on the test voice to implement identity authentication of the user; or if the prediction result is non-real Pronunciation, it is determined that voiceprint recognition is not performed on the test speech.
  • the acquiring module may include: a first collecting submodule, configured to collect a real human voice, and marked as the real human pronunciation sample; and a second collecting submodule, configured to collect non-real human pronunciation, and mark A sample is pronounced for the non-real person.
  • the second collection sub-module is configured to: perform recording collection on the real-life pronunciation sample, and mark the real-life recording sub-sample in the non-real person pronunciation sample; and record the synthesized vocal pronunciation Collected and labeled as a synthesized vocal recording subsample in the non-real human pronunciation sample.
  • the extracting module is configured to separately extract the sound spectrum map corresponding to the real human voice sample, the live voice recording subsample, and the synthesized voice recording subsample.
  • the training module is configured to use the sound spectrum map corresponding to the real human voice sample, the live voice recording subsample, and the synthesized voice recording subsample as training data input reference model for training, to obtain a trained position. The optimization parameters of the reference model are described.
  • the training module is further configured to: acquire a difference feature value between the real person pronunciation sample and the real person recording subsample to obtain a first optimization parameter; and acquire the real person pronunciation sample and And synthesizing the difference feature values between the vocal recording subsamples to obtain a second optimization parameter.
  • the generating module is further configured to generate a living body detection model according to the first optimization parameter and the second optimization parameter.
  • FIG. 5 to FIG. 6 are schematic structural diagrams of a voice recognition device according to an embodiment of the present application.
  • the voice recognition device 30 includes an acquisition module 31, an extraction module 32, a training module 33, a generation module 34, a detection module 35, and an identification module 36.
  • the obtaining module 31 is configured to acquire a training sample, where the training sample includes a real human pronunciation sample and a non-real human pronunciation sample.
  • the obtaining module 31 further includes a first collecting sub-module 311 and a second collecting sub-module 312 .
  • the first collection sub-module 311 is configured to collect a real human voice and mark the live human voice sample
  • the second collection sub-module 312 is configured to collect a non-real person pronunciation and mark the non-real person pronunciation sample.
  • the second collection sub-module 312 is configured to perform recording collection on the real-life pronunciation sample, and mark the live-action recording sub-sample in the non-real person pronunciation sample; perform recording recording on the synthesized vocal pronunciation, and mark it as A synthetic vocal recording subsample in a non-real person pronunciation sample.
  • the extracting module 32 is configured to extract feature information in the training sample.
  • the extracting module 32 is configured to separately extract the sound spectrum map corresponding to the real human voice sample, the live voice recording subsample, and the synthesized voice recording subsample.
  • the training module 33 is configured to input the training sample and the feature information as training data into a reference model for training to obtain optimized parameters of the reference model after training.
  • the training module 33 is configured to input a spectrogram corresponding to the real human pronunciation sample, the live recording subsample, and the synthesized human voice subsample as training data into a reference model. Training to obtain optimized parameters of the reference model after training.
  • the training module 33 is further configured to acquire a difference feature value between the human voice sample and the live voice sample to obtain a first optimization parameter; and obtain the live speaker sample and And synthesizing the difference feature values between the vocal recording subsamples to obtain a second optimization parameter.
  • the generating module 34 is configured to generate a living body detection model according to the optimization parameter.
  • the generating module 34 is further configured to generate a living body detection model according to the first optimization parameter and the second optimization parameter.
  • the detecting module 35 is configured to perform a living body detection on the test voice by using the living body detection model to generate a prediction result when the test voice is received.
  • the identifying module 36 is configured to determine, according to the prediction result, whether voiceprint recognition is performed on the test voice.
  • the identification module 36 is configured to determine voiceprint recognition of the test voice to implement identity authentication of the user if the prediction result is a real person pronunciation, or determine if the prediction result is a non-real person pronunciation, Voiceprint recognition is not performed on the test speech.
  • the embodiment of the present application acquires a training sample by the obtaining module 31, the training sample includes a real human pronunciation sample and a non-real human pronunciation sample, the extraction module 32 extracts feature information in the training sample, and the training module 33 uses the training sample and the training sample.
  • the feature information is trained as a training data input reference model to obtain an optimized parameter of the reference model after the training, and the generating module 34 generates a living body detection model according to the optimized parameter.
  • the detecting module 35 passes the The living body detection model performs a living body detection on the test voice to generate a prediction result, and the identification module 36 determines whether to perform voiceprint recognition on the test voice according to the prediction result.
  • the speech recognition apparatus 30 of the embodiment of the present invention performs model training by using the labeled real human pronunciation sample and the non-real human pronunciation sample, and is further integrated into the voiceprint recognition system to perform voiceprint recognition according to the optimized parameters obtained by the current model, and can accurately identify Produce real people's pronunciation to prevent others from using the recording or vocal synthesis to break through and improve the safety of the equipment.
  • An embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor calling the computer program stored in the memory, executing the program Applying the speech recognition method described in any of the embodiments.
  • the electronic device can be a device such as a smart phone, a tablet computer, or a palmtop computer.
  • electronic device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more computer readable storage media, and a computer program stored on the memory and operable on the processor. .
  • the processor 401 is electrically connected to the memory 402. It will be understood by those skilled in the art that the structure of the electronic device shown in the drawings does not constitute a limitation of the electronic device, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements.
  • the processor 401 is a control center of the electronic device 400, which connects various parts of the entire electronic device using various interfaces and lines, executes the electronic by running or loading an application stored in the memory 402, and calling data stored in the memory 402.
  • the various functions and processing data of the device enable overall monitoring of the electronic device.
  • the processor 401 in the electronic device 400 loads the instructions corresponding to the process of one or more applications into the memory 402 according to the following steps, and is stored in the memory by the processor 401.
  • the application in 402 to implement various functions:
  • the training sample including a human voice sample and a non-real person pronunciation sample
  • test voice When the test voice is received, the test voice is subjected to biometric detection by the living body detection model to generate a prediction result;
  • Determining whether to perform voiceprint recognition on the test speech is performed according to the prediction result.
  • the processor 401 is configured to determine, according to the prediction result, whether to perform voiceprint recognition on the test voice, including:
  • the predicted result is a human voice, determining to perform voiceprint recognition on the test voice to implement identity authentication of the user; or
  • the predicted result is a non-real person pronunciation, it is determined that voiceprint recognition is not performed on the test voice.
  • the processor 401 is configured to obtain the training sample, including:
  • the non-real person pronunciation is collected and labeled as the non-real person pronunciation sample, wherein the non-real person pronunciation sample includes a real person recording subsample and a synthetic vocal recording subsample.
  • the processor 401 is configured to collect the non-real person pronunciation and mark the non-real person pronunciation sample, including:
  • the synthesized vocal pronunciation is recorded and recorded as a synthesized vocal recording subsample in the non-human vocal sample.
  • the processor 401 is configured to extract feature information in the training sample, including:
  • the training sample and the feature information are input as training data into a reference model for training, to obtain optimized parameters of the reference model after training, including:
  • the sound spectrum map corresponding to the real human voice sample, the live voice recording subsample and the synthesized voice recording subsample are used as training data input reference models for training, so as to obtain optimized parameters of the reference model after training.
  • the processor 401 is configured to use the optimized parameters of the reference model after the training, including:
  • the generating the living body detection model according to the optimization parameter includes: generating a living body detection model according to the first optimization parameter and the second optimization parameter.
  • the processor 401 is configured to use the optimized parameters of the reference model after the training, including:
  • the first intermediate value is obtained by inputting a sound spectrum image corresponding to the real human pronunciation sample, the real person recording subsample, and the synthesized human voice recording subsample as the training data into the convolution layer;
  • Training is performed according to the loss value to obtain optimized parameters.
  • the electronic device 400 further includes a display screen 403, a microphone 404, an audio circuit 405, an input unit 406, and a radio frequency circuit 407.
  • the processor 401 is electrically connected to the display screen 403, the microphone 404, the audio circuit 405, the input unit 406, and the RF circuit 407, respectively.
  • the electronic device structure illustrated in FIG. 8 does not constitute a limitation to the electronic device, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements.
  • the display screen 403 can be used to display information entered by the user or information provided to the user as well as various graphical user interfaces of the electronic device, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the input function can also be implemented as part of the input unit.
  • the microphone 404 can be used to convert a sound signal into an electrical signal to effect recording or input of a sound signal or the like. For example, the user's test voice and the like can be recorded through the microphone 404.
  • the audio circuit 405 can be used to provide an audio interface between the user and the electronic device through the speaker and the microphone.
  • the input unit 406 can be configured to receive input digits, character information, or user characteristic information (eg, fingerprints), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function controls.
  • user characteristic information eg, fingerprints
  • the radio frequency circuit 404 can be used to transmit and receive radio frequency signals to establish wireless communication with network devices or other electronic devices through wireless communication, and to transmit and receive signals with network devices or other electronic devices.
  • the electronic device 400 may further include a camera, a sensor, a wireless fidelity module, a Bluetooth module, a power supply, and the like, and details are not described herein.
  • the voice recognition device belongs to the same concept as the voice recognition method in the foregoing embodiment, and any method provided in the voice recognition method embodiment may be run on the voice recognition device.
  • the specific implementation process is described in the embodiment of the voice recognition method, and details are not described herein again.
  • the embodiment of the present application further provides a storage medium storing a computer program, when the computer program is run on a computer, causing the computer to execute the voice recognition method in any of the above embodiments.
  • the computer program may be stored in a computer readable storage medium, such as in a memory of the electronic device, and executed by at least one processor within the electronic device, and may include a voice recognition method as described in the execution process.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), a random access memory (RAM), or the like.
  • each functional module may be integrated into one processing chip, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as separate products, may also be stored in a computer readable storage medium such as a read only memory, a magnetic disk or an optical disk.

Abstract

一种语音识别方法,包括:获取训练样本,训练样本包括真人发音样本和非真人发音样本(101);提取训练样本中的特征信息(102);将训练样本和特征信息作为训练数据输入参考模型中训练,得到优化参数(103);根据优化参数生成活体检测模型(104);当接收到测试语音时,通过活体检测模型对测试语音进行活体检测,生成预测结果(105);根据预测结果确定是否对测试语音进行声纹识别(106)。

Description

语音识别方法、装置、存储介质及电子设备
本申请要求于2018年05月02日提交中国专利局、申请号为201810411000.9、发明名称为“一种语音识别方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及移动通信技术领域,尤其涉及移动设备技术领域,具体涉及一种语音识别方法、装置、存储介质及电子设备。
背景技术
随着电子技术的发展以及智能电子设备的普及,信息安全问题尤为突出。在对智能手机等电子设备进行唤醒或者解锁等用户身份认证时,电子设备通常提供声纹识别算法,但是声纹识别在录音重放以及语音合成攻破的安全保护存在一定的问题。
发明内容
本申请实施例提供一种语音识别方法、装置、存储介质及电子设备,能够识别真人发音,防止他人利用录音或者人声合成进行安全攻破,提升安全性。
第一方面,本申请实施例提供了一种语音识别方法,应用于电子设备中,所述方法包括:
获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取所述训练样本中的特征信息;
将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
根据所述优化参数生成活体检测模型;
当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
根据所述预测结果确定是否对所述测试语音进行声纹识别。
第二方面,本申请实施例还提供了一种语音识别装置,所述装置包括:
获取模块,用于获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取模块,用于提取所述训练样本中的特征信息;
训练模块,用于将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
生成模块,用于根据所述优化参数生成活体检测模型;
检测模块,用于当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
识别模块,用于根据所述预测结果确定是否对所述测试语音进行声纹识别。
第三方面,本申请实施例还提供了一种存储介质,其上存储有计算机程序,当所述计 算机程序在计算机上运行时,使得所述计算机执行如上述的语音识别方法。
第四方面,本申请实施例还提供了一种电子设备,包括存储器和处理器,所述处理器通过调用所述存储器中存储的计算机程序,用于执行步骤:
获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取所述训练样本中的特征信息;
将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
根据所述优化参数生成活体检测模型;
当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
根据所述预测结果确定是否对所述测试语音进行声纹识别。
附图说明
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其它有益效果显而易见。
图1为本申请实施例提供的一种语音识别装置的系统示意图。
图2为本申请实施例提供的一种语音识别装置的应用场景示意图。
图3为本申请实施例提供的一种语音识别方法的流程示意图。
图4为本申请实施例提供的一种语音识别方法的另一流程示意图。
图5为本申请实施例提供的一种语音识别装置的结构示意图。
图6为本申请实施例提供的一种语音识别装置的另一结构示意图。
图7为本申请实施例提供的一种电子设备的结构示意图。
图8为本申请实施例提供的一种电子设备的另一结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。可以理解的是,此处所描述的具体实施例仅用于解释本申请,而非对本申请的限定。另外,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在现有技术中,在对智能手机等电子设备进行唤醒或者解锁等用户身份认证时,电子设备通常提供声纹识别算法,但是声纹识别在录音重放以及语音合成攻破的安全保护存在一定的问题。其中,所述电子设备可以是智能手机、平板电脑、台式电脑、笔记本电脑、或者掌上电脑等设备。
请参阅图1,图1为本申请实施例提供的一种语音识别装置的系统示意图。该语音识别装置主要用于:预先获取训练样本,并提取训练样本中的特征信息,将所述训练样本以 及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,根据所述优化参数生成活体检测模型,当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果,并根据所述预测结果确定是否对所述测试语音进行声纹识别能够准确识别出真人发音,以防止他人利用录音或者人声合成进行安全攻破,提升设备的安全性。
具体的,请参阅图2,图2为本申请实施例提供的一种语音识别装置的应用场景示意图。比如,语音识别装置在接收到用户输入的测试语音时,将测试语音的特征信息输入到活体检测模型中进行活体检测,以生成预测结果,若所述预测结果为真人发音,则对所述测试语音进行声纹识别以实现用户的身份认证,当身份认证通过时,对电子设备进行解锁,如图2中的状态B所示,当身份认证未通过时,提示认证失败并维持锁定状态。若所述预测结果为非真人发音,则禁止对所述测试语音进行身份认证,并可以发出语音提示或文本提示“非真人发音,禁止认证”,如图2中的状态C所示。
本申请实施例提供的一种语音识别方法的执行主体,可以为本申请实施例提供的一种语音识别装置,或者集成了所述语音识别装置的电子设备(譬如掌上电脑、平板电脑、智能手机等),所述语音识别装置可以采用硬件或者软件的方式实现。
本发明实施例提供一种语音识别方法,包括:
获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取所述训练样本中的特征信息;
将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
根据所述优化参数生成活体检测模型;
当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
根据所述预测结果确定是否对所述测试语音进行声纹识别。
在一种实施方式中,该根据所述预测结果确定是否对所述测试语音进行声纹识别的步骤,可以包括:若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
在一种实施方式中,该获取训练样本的步骤,可以包括:采集真人发音,并标记为所述真人发音样本;采集非真人发音,并标记为所述非真人发音样本,其中所述非真人发音样本包括真人录音子样本与合成人声录音子样本。
在一种实施方式中,该采集非真人发音,并标记为所述非真人发音样本的步骤,可以包括:对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子 样本。
在一种实施方式中,该提取所述训练样本中的特征信息的步骤,可以包括:分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图。
在一种实施方式中,该将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数的步骤,可以包括:将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一种实施方式中,该得到训练后的所述参考模型的优化参数的步骤,可以包括:开始计时;获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数,以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数。
在一种实施方式中,该根据所述优化参数生成活体检测模型的步骤,可以包括:根据所述第一优化参数与所述第二优化参数生成活体检测模型。
在一种实施方式中,该得到训练后的所述参考模型的优化参数的步骤,可以包括:将所述训练样本中的真人发音样本、真人录音子样本与合成人声录音子样本对应的声谱图作为训练数据输入卷积层得到第一中间值;将第一中间值输入全连接层得到第二中间值;将第二中间值输入分类器得到对应多个预测结果的概率;根据多个预测结果和与其对应的多个概率得到损失值;根据损失值进行训练,得到优化参数。
请参阅图3至图4,图3至图4均为本申请实施例提供的一种语音识别方法的流程示意图。所述方法应用于电子设备中,所述方法包括:
步骤101,获取训练样本,所述训练样本包括真人发音样本和非真人发音样本。
在一些实施例中,如图4所示,步骤101可以通过步骤1011以及步骤1012来实现,具体为:
步骤1011,采集真人发音,并标记为所述真人发音样本。
步骤1012,采集非真人发音,并标记为所述非真人发音样本,其中所述非真人发音样本包括真人录音子样本与合成人声录音子样本。
在一些实施例中,所述采集非真人发音,并标记为所述非真人发音样本,包括:
对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;
对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
例如,利用手机等电子设备中的麦克风等录音设备先采集用户输入的真人发音并标记为真人发音样本,再对录制的真人发音样本或者合成人声发音进行录音采集并标记非真人发音样本。
其中,所述训练样本可以为一个样本集合M,该样本集M合包括多个样本组m,为了增加模型训练的准确性,在选择训练样本上,可以选择语音信息更为接近的训练样本进行训练,比如每一样本组可以包括一组语音内容相同的真人发音样本、真人录音子样本与合成人声录音子样本。例如,所述训练样本包括样本集合M,M包括多个样本组{m1,m2,m3…,mn},其中第一样本组m1包括{x1,y1,z1},其中x1表示用户输入的语音内容为“今天天气不错”的真人录音样本,y1表示语音内容为“今天天气不错”的真人录音样本被电子设备回放后再被录音设备录制下来的真人录音子样本,z1表示语音内容为“今天天气不错”的合成人声录音子样本。
步骤102,提取所述训练样本中的特征信息。
其中,每一个声音都具有独特的特征信息,通过该特征信息能将不同人的声音进行有效的区分。
需要说明的是,这种独特的特征信息主要由两个因素决定,第一个是声腔的尺寸,具体包括咽喉、鼻腔和口腔等,这些器官的形状、尺寸和位置决定了声带张力的大小和声音频率的范围。因此不同的人虽然说同样的话,但是声音的频率分布是不同的,听起来有的低沉有的洪亮。每个人的发声腔都是不同的,就像指纹一样,每个人的声音也就有独特的特征信息。第二个决定声音特征信息的因素是发声器官被操纵的方式,发声器官包括唇、齿、舌、软腭及腭肌肉等,他们之间相互作用就会产生清晰的语音。而他们之间的协作方式是人通过后天与周围人的交流中随机学习到的。人在学习说话的过程中,通过模拟周围不同人的说话方式,就会逐渐形成自己的声纹特征信息。例如,声音的波长、频率、强度、节奏、音色,或者声谱图中的频率、相位、幅度等特征均能体现出不同声音之间的差别。
但是针对一组语音内容相同的真人发音与非真人发音之间从人耳分辩或者声纹识别系统中不容易区分开。而真人发音与非真人发音之间在某些特征值之间肯定存在差异参数。为了找出真人发音与真人录音或者合成之间不同的特征值,从而有效鉴别出是否为真人发音,则需要获取大量的训练样本进行训练。
在一些实施例中,可以分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图。其中,所述声谱图作为所述训练样本对应的特征信息。
例如,将每一训练样本转换为对应的声谱图,利用声谱图来体现训练样本的特征信息。
步骤103,根将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一些实施例中,将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
例如,所述参考模型可以选择卷积神经网络模型。当然可以为隐马尔科夫模型、高斯混合模型等。
其中,卷积神经网络模型包括依次连接的卷积层、全连接层和分类器。具体的,该卷积神经网络主要包括网络结构部分和网络训练部分,其中网络结构部分包括依次连接的卷积层和全连接层。卷积层和全连接层之间还可以包括激励层和池化层。
可选的,卷积神经网络模型的网络结构部分可以包括五层网络,前三层为卷积层,卷积核大小统一为3×3,滑动步长统一为1,由于维度较小,可以不采用池化层,后两层为全连接层,分别为20个神经元、2个神经元。
需要说明的是,网络结构部分还可以包括其他层数的卷积层,如3层卷积层、7层卷积层、9层卷积层等,还可以包括其他层数的全连接层,如1层全连接层、3层全连接层等。也可以增加池化层,也可以不采用池化层。卷积核大小可以采用其他大小,如2×2。还可以不同的卷积层采用不同大小的卷积核,如第一层卷积层采用3×3的卷积核,其他层卷积层采用2×2的卷积核。滑动步长可以统一为2或其他值,也可以采用不一样的滑动步长,如第一层滑动步长为2,其他层滑动步长为1等。
例如,训练方法可以包括以下步骤:
(1)将所述训练样本中的真人发音样本、真人录音子样本与合成人声录音子样本对应的声谱图作为训练数据输入卷积层得到第一中间值。
(2)将第一中间值输入全连接层得到第二中间值。
(3)将第二中间值输入分类器得到对应多个预测结果的概率。
其中,得到预测结果的概率可以基于第一预设公式将第二中间值输入分类器得到对应多个预测结果的概率,其中第一预设公式为:
Figure PCTCN2019084131-appb-000001
其中,ZK为目标第二中间值,C为预测结果的类别数,Zj为第j个第二中间值。
(4)根据多个预测结果和与其对应的多个概率得到损失值。
其中,得到损失值可以基于第二预设公式根据多个预测结果和与其对应的多个概率得到损失值,其中第二预设公式为:
Figure PCTCN2019084131-appb-000002
其中C为预测结果的类别数,y k为真实值。
(5)根据损失值进行训练,得到优化参数。
其中,可以根据损失值利用随机梯度下降法进行训练。还可以根据批量梯度下降法或梯度下降方法进行训练。
利用随机梯度下降法进行训练,可以当损失值等于或小于预设损失值时,则完成训练。 也可以当连续获取的两个或多个损失值没有变化时,则完成训练。当然还可以不根据损失值,直接设定随机梯度下降法的迭代次数,迭代次数完成后,则完成训练。训练完成后,获取此时的参考模型的各个参数,并将该各个参数保存为优化参数,后续需要预测时,使用该优化参数进行预测。
进一步的,得到损失值可以基于第三预设公式根据多组参数得到损失值,每组参数包括多个预测结果和与其对应的多个概率得到损失值,其中第三预设公式为:
Figure PCTCN2019084131-appb-000003
其中C为预测结果的类别数,y k为真实值,E为平均值。
其中可以采用小批量的方式训练得到最优参数。如批量大小为128,第三预设公式中的E表示为128个损失值的平均值。
进一步的,可以先获取多个样本集,多个样本集构建成多个二维的声谱图,然后将多个声谱图作为训练数据输入参考模型,得到多个损失值,然后求多个损失值的平均值。
其中,所述优化参数用于表示真人发音与非真人发音之间的差异特征值,利用该优化参数可以有效地区分出真人发音与非真人发音。
在一些实施例中,所述得到训练后的所述参考模型的优化参数,包括:
获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数,以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数。
其中,利用第一优化参数可以有效地区分出真人发音与真人录音。利用第二优化参数可以有效地区分出真人发音与合成人声录音。
其中,在获取真人发音与真人录音的差异特征值时,由于低频语音信号和/或高频语音信号的差异度更明显,例如,低频语音信号中的真人录音信号相比于真人发音会缺失一些信息,例如真人发音信号中的音频信号波形在某个相位发生波形削减,则在该相位的波形削减程度可以看成是真人发音样本与真人录音子样本之间的差异特征值。则在利用训练样本训练参考模型时,该训练样本可以选取更多的高频的样本或者低频的样本进行训练,以得出更佳的优化参数,即通过将训练样本输入到参考模型中进行训练,由参考模型在不断的深度学习及训练过程中依靠模型自身找出真人发音与非真人发音之间的差异特征值,通过参考模型的深度学习得出优化参数,从输入到输出不再需要人工参与,而由可以深度学习的参考模型来完成。
步骤104,根据所述优化参数生成活体检测模型。
其中,所述活体检测模型依赖于上述训练得出的优化参数,所述活体检测模型利用该优化参数可以有效地检测出真人发音与非真人发音。
在一些实施例中,根据所述第一优化参数与所述第二优化参数生成活体检测模型。所述活体检测模型依赖于上述训练得出的优化参数,所述活体检测模型利用利用第一优化参数可以有效地区分出真人发音与真人录音,利用第二优化参数可以有效地区分出真人发音与合成人声录音。
步骤105,当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果。
其中,当接收到测试语音时,该测试语音可能为真人发音,或者也可能为录音等非真人发音,此时通过所述活体检测模型对所述测试语音进行活体检测,结合测试语音的特征信息以及所述活体检测模型中的优化参数进行检测,可以生成准确度较高的预测结果。该预测结果可以包括真人发音和非真人发音两个结果。该预测结果还可以包括真人发音、真人录音与合成人声三个结果。
步骤106,根据所述预测结果确定是否对所述测试语音进行声纹识别。
其中,若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
例如,若所述预测结果为真人发音,则将所述测试语音输入到声纹识别系统中进行声纹识别以实现用户的身份认证,例如将该测试语音与声纹识别库中的存储的预设用户的声纹模板进行匹配,若匹配成功则表示身份认证通过,若匹配失败则表示身份认证不通过。当身份认证通过时,对电子设备进行解锁或者唤醒操作,当身份认证未通过时,可以提示认证失败并维持锁定状态或者不响应唤醒操作。
若所述预测结果为非真人发音,则禁止对所述测试语音进行身份认证,还可以发出语音提示或文本提示,以提醒用户该测试语音为非真人发音,可能存在安全隐患。例如,当所述预测结果为非真人发音时,还可以向与当前设备绑定的其他用户设备或者用户邮箱发送提示消息,以提示用户所述当前设备目前正处于被他人非法认证的情形。例如,当所述预测结果为非真人发音时,当前设备还可以进入自保模式,该自保模式可以包括改变解锁方式,比如将解锁方式从声纹解锁改变为声纹识别与人脸识别相结合的解锁方式,以增加解锁难度。该自保模式可以包括启动自动关机功能。该自保模式可以包括自动隐藏当前设备中的隐私信息,比如隐藏标记为隐私信息的文件夹,或者隐藏包括支付或金融管理功能的应用程序,或者隐藏即时聊天应用程序,可以更有利的保护用户的信息安全。
需要说明的是,参考模型的训练过程可以在服务器端也可以在电子设备端。当参考模型的训练过程、实际预测过程都在服务器端完成时,需要使用优化后的参考模型进而生成的活体检测模型时,可以将测试语音以及测试语音对应的特征信息输入到服务器,服务器实际预测完成后,将预测结果发送至电子设备端,电子设备再根据预测结果选择是否进入下一步的身份认证。
当参考模型的训练过程、实际预测过程都在电子设备端完成时,需要使用优化后的参 考模型进而生成的活体检测模型时,可以将测试语音以及测试语音对应的特征信息输入到电子设备,电子设备实际预测完成后,电子设备根据预测结果选择是否进入下一步的身份认证。
当参考模型的训练过程在服务器端完成,参考模型的实际预测过程在电子设备端完成时,需要使用优化后的参考模型进而生成的活体检测模型时,可以将测试语音以及测试语音对应的特征信息输入到电子设备,电子设备实际预测完成后,电子设备根据预测结果选择是否进入下一步的身份认证。可选的,可以将训练好的活体检测模型文件(model文件)移植到智能设备上,若需要对输入的测试语音进行活体检测,则讲测试语音输入到训练好的活体检测模型文件(model文件),计算即可得到预测结果。
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
本申请实施例通过获取训练样本,所述训练样本包括真人发音样本和非真人发音样本,提取所述训练样本中的特征信息,将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,根据所述优化参数生成活体检测模型,当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果,并根据所述预测结果确定是否对所述测试语音进行声纹识别。本申请实施例通过利用标记好的真人发音样本和非真人发音样本进行模型训练,并根据当前模型得到的优化参数再融入到声纹识别系统中进行声纹识别,能够准确识别出真人发音,以防止他人利用录音或者人声合成进行安全攻破,提升设备的安全性。
本发明实施例提供一种语音识别装置,包括:
获取模块,用于获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取模块,用于提取所述训练样本中的特征信息;
训练模块,用于将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
生成模块,用于根据所述优化参数生成活体检测模型;
检测模块,用于当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
识别模块,用于根据所述预测结果确定是否对所述测试语音进行声纹识别。
在一种实施方式中,该识别模块,用于:若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
在一种实施方式中,该获取模块,可以包括:第一采集子模块,用于采集真人发音,并标记为所述真人发音样本;第二采集子模块,用于采集非真人发音,并标记为所述非真人发音样本。
在一种实施方式中,该第二采集子模块,用于:对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
在一种实施方式中,该提取模块,用于:分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图。该训练模块,用于将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一种实施方式中,该训练模块,还用于:获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数;以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数。该生成模块,还用于根据所述第一优化参数与所述第二优化参数生成活体检测模型。
本申请实施例还提供一种语音识别装置,如图5至图6所示,图5至图6均为本申请实施例提供的一种语音识别装置的结构示意图。所述语音识别装置30包括获取模块31,提取模块32,训练模块33,生成模块34,检测模块35以及识别模块36。
其中,所述获取模块31,用于获取训练样本,所述训练样本包括真人发音样本和非真人发音样本。
在一些实施例中,如图6所示,所述获取模块31还包括第一采集子模块311和第二采集子模块312。
其中,所述第一采集子模块311,用于采集真人发音,并标记为所述真人发音样本;
所述第二采集子模块312,用于采集非真人发音,并标记为所述非真人发音样本。
所述第二采集子模块312,用于对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
所述提取模块32,用于提取所述训练样本中的特征信息。
在一些实施例中,所述提取模块32,用于分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图。
所述训练模块33,用于将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一些实施例中,所述训练模块33,用于将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一些实施例中,所述训练模块33,还用于获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数;以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数。
所述生成模块34,用于根据所述优化参数生成活体检测模型。
在一些实施例中,所述生成模块34,还用于根据所述第一优化参数与所述第二优化参数生成活体检测模型。
所述检测模块35,用于当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果。
所述识别模块36,用于根据所述预测结果确定是否对所述测试语音进行声纹识别。
其中,所述识别模块36,用于若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
上述所有的技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
本申请实施例通过获取模块31获取训练样本,所述训练样本包括真人发音样本和非真人发音样本,提取模块32提取所述训练样本中的特征信息,训练模块33将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,生成模块34根据所述优化参数生成活体检测模型,当接收到测试语音时,检测模块35通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果,识别模块36根据所述预测结果确定是否对所述测试语音进行声纹识别。本申请实施例的语音识别装置30通过利用标记好的真人发音样本和非真人发音样本进行模型训练,并根据当前模型得到的优化参数再融入到声纹识别系统中进行声纹识别,能够准确识别出真人发音,以防止他人利用录音或者人声合成进行安全攻破,提升设备的安全性。
本申请实施例还提供一种电子设备,包括存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器调用所述存储器中存储的所述计算机程序,执行本申请任一实施例所述的语音识别方法。
该电子设备可以是智能手机、平板电脑、掌上电脑等设备。如图7所示,电子设备400包括有一个或者一个以上处理核心的处理器401、有一个或一个以上计算机可读存储介质的存储器402及存储在存储器上并可在处理器上运行的计算机程序。其中,处理器401与存储器402电性连接。本领域技术人员可以理解,图中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
处理器401是电子设备400的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器402内的应用程序,以及调用存储在存储器402内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。
在本申请实施例中,电子设备400中的处理器401会按照如下的步骤,将一个或一个以上的应用程序的进程对应的指令加载到存储器402中,并由处理器401来运行存储在存 储器402中的应用程序,从而实现各种功能:
获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
提取所述训练样本中的特征信息;
将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
根据所述优化参数生成活体检测模型;
当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
根据所述预测结果确定是否对所述测试语音进行声纹识别。
在一些实施例中,处理器401用于所述根据所述预测结果确定是否对所述测试语音进行声纹识别,包括:
若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者
若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
在一些实施例中,处理器401用于所述获取训练样本,包括:
采集真人发音,并标记为所述真人发音样本;
采集非真人发音,并标记为所述非真人发音样本,其中所述非真人发音样本包括真人录音子样本与合成人声录音子样本。
在一些实施例中,处理器401用于所述采集非真人发音,并标记为所述非真人发音样本,包括:
对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;
对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
在一些实施例中,处理器401用于所述提取所述训练样本中的特征信息,包括:
分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图;
所述将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,包括:
将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
在一些实施例中,处理器401用于所述得到训练后的所述参考模型的优化参数,包括:
获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数,以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第 二优化参数;
所述根据所述优化参数生成活体检测模型,包括:根据所述第一优化参数与所述第二优化参数生成活体检测模型。
在一些实施例中,处理器401用于所述得到训练后的所述参考模型的优化参数,包括:
将所述训练样本中的真人发音样本、真人录音子样本与合成人声录音子样本对应的声谱图作为训练数据输入卷积层得到第一中间值;
将第一中间值输入全连接层得到第二中间值;
将第二中间值输入分类器得到对应多个预测结果的概率;
根据多个预测结果和与其对应的多个概率得到损失值;
根据损失值进行训练,得到优化参数。
在一些实施例中,如图8所示,电子设备400还包括:显示屏403、麦克风404、音频电路405、输入单元406以及射频电路407。其中,处理器401分别与显示屏403、麦克风404、音频电路405、输入单元406以及射频电路407电性连接。本领域技术人员可以理解,图8中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
显示屏403可用于显示由用户输入的信息或提供给用户的信息以及电子设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示屏403为触控显示屏时,也可以作为输入单元的一部分实现输入功能。
麦克风404可以用于将声音信号转换为电信号,以实现声音信号的录制或输入等。比如,可以通过麦克风404录制用户的测试语音等。
音频电路405可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。
输入单元406可用于接收输入的数字、字符信息或用户特征信息(例如指纹),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
射频电路404可用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。
尽管图8中未示出,电子设备400还可以包括摄像头、传感器、无线保真模块、蓝牙模块、电源等,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
本申请实施例中,所述语音识别装置与上文实施例中的一种语音识别方法属于同一构思,在所述语音识别装置上可以运行所述语音识别方法实施例中提供的任一方法,其具体实现过程详见所述语音识别方法实施例,此处不再赘述。
本申请实施例还提供一种存储介质,所述存储介质存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述任一实施例中的语音识别方法。
需要说明的是,对本申请所述语音识别方法而言,本领域普通测试人员可以理解实现本申请实施例所述语音识别方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如所述语音识别方法的实施例的流程。其中,所述存储介质可为磁碟、光盘、只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。
对本申请实施例的所述语音识别装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种语音识别方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的技术方案及其核心思想;本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例的技术方案的范围。

Claims (20)

  1. 一种语音识别方法,应用于电子设备中,其中,所述方法包括:
    获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
    提取所述训练样本中的特征信息;
    将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
    根据所述优化参数生成活体检测模型;
    当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
    根据所述预测结果确定是否对所述测试语音进行声纹识别。
  2. 如权利要求1所述的语音识别方法,其中,所述根据所述预测结果确定是否对所述测试语音进行声纹识别,包括:
    若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者
    若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
  3. 如权利要求1所述的语音识别方法,其中,所述获取训练样本,包括:
    采集真人发音,并标记为所述真人发音样本;
    采集非真人发音,并标记为所述非真人发音样本,其中所述非真人发音样本包括真人录音子样本与合成人声录音子样本。
  4. 如权利要求3所述的语音识别方法,其中,所述采集非真人发音,并标记为所述非真人发音样本,包括:
    对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;
    对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
  5. 如权利要求4所述的语音识别方法,其中,所述提取所述训练样本中的特征信息,包括:
    分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图;
    所述将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,包括:
    将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
  6. 如权利要求5所述的语音识别方法,其中,所述得到训练后的所述参考模型的优化 参数,包括:
    获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数,以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数;
    所述根据所述优化参数生成活体检测模型,包括:根据所述第一优化参数与所述第二优化参数生成活体检测模型。
  7. 如权利要求1所述的语音识别方法,其中,所述得到训练后的所述参考模型的优化参数,包括:
    将所述训练样本中的真人发音样本、真人录音子样本与合成人声录音子样本对应的声谱图作为训练数据输入卷积层得到第一中间值;
    将第一中间值输入全连接层得到第二中间值;
    将第二中间值输入分类器得到对应多个预测结果的概率;
    根据多个预测结果和与其对应的多个概率得到损失值;
    根据损失值进行训练,得到优化参数。
  8. 一种语音识别装置,其中,所述装置包括:
    获取模块,用于获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
    提取模块,用于提取所述训练样本中的特征信息;
    训练模块,用于将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
    生成模块,用于根据所述优化参数生成活体检测模型;
    检测模块,用于当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
    识别模块,用于根据所述预测结果确定是否对所述测试语音进行声纹识别。
  9. 如权利要求8所述的语音识别装置,其中,所述识别模块,用于:
    若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者
    若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
  10. 如权利要求8所述的语音识别装置,其中,所述获取模块还包括:
    第一采集子模块,用于采集真人发音,并标记为所述真人发音样本;
    第二采集子模块,用于采集非真人发音,并标记为所述非真人发音样本。
  11. 如权利要求10所述的语音识别装置,其中,第二采集子模块,用于:
    对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;
    对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样 本。
  12. 如权利要求11所述的语音识别装置,其中,所述提取模块,用于分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图;
    所述训练模块,用于将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
  13. 如权利要求12所述的语音识别装置,其中,所述训练模块,还用于获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数;以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数;
    所述生成模块,还用于根据所述第一优化参数与所述第二优化参数生成活体检测模型。
  14. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1所述的语音识别方法。
  15. 一种电子设备,包括存储器和处理器,其中,所述处理器通过调用所述存储器中存储的计算机程序,用于执行步骤:
    获取训练样本,所述训练样本包括真人发音样本和非真人发音样本;
    提取所述训练样本中的特征信息;
    将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数;
    根据所述优化参数生成活体检测模型;
    当接收到测试语音时,通过所述活体检测模型对所述测试语音进行活体检测,以生成预测结果;
    根据所述预测结果确定是否对所述测试语音进行声纹识别。
  16. 如权利要求15所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    若所述预测结果为真人发音,则确定对所述测试语音进行声纹识别以实现用户的身份认证;或者
    若所述预测结果为非真人发音,则确定不对所述测试语音进行声纹识别。
  17. 如权利要求15所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    采集真人发音,并标记为所述真人发音样本;
    采集非真人发音,并标记为所述非真人发音样本,其中所述非真人发音样本包括真人录音子样本与合成人声录音子样本。
  18. 如权利要求17所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    对所述真人发音样本进行录音采集,并标记为所述非真人发音样本中的真人录音子样本;
    对合成人声发音进行录音采集,并标记为所述非真人发音样本中的合成人声录音子样本。
  19. 如权利要求18所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    分别提取所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本对应的声谱图;
    所述将所述训练样本以及所述特征信息作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数,包括:
    将所述真人发音样本、所述真人录音子样本与所述合成人声录音子样本分别对应的声谱图作为训练数据输入参考模型中进行训练,以得到训练后的所述参考模型的优化参数。
  20. 如权利要求19所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    获取所述真人发音样本与所述真人录音子样本之间的差异特征值,以得到第一优化参数,以及获取所述真人发音样本与所述合成人声录音子样本之间的差异特征值,以得到第二优化参数;
    所述根据所述优化参数生成活体检测模型,包括:根据所述第一优化参数与所述第二优化参数生成活体检测模型。
PCT/CN2019/084131 2018-05-02 2019-04-24 语音识别方法、装置、存储介质及电子设备 WO2019210796A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810411000.9A CN110459204A (zh) 2018-05-02 2018-05-02 语音识别方法、装置、存储介质及电子设备
CN201810411000.9 2018-05-02

Publications (1)

Publication Number Publication Date
WO2019210796A1 true WO2019210796A1 (zh) 2019-11-07

Family

ID=68387027

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/084131 WO2019210796A1 (zh) 2018-05-02 2019-04-24 语音识别方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN110459204A (zh)
WO (1) WO2019210796A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111464519A (zh) * 2020-03-26 2020-07-28 支付宝(杭州)信息技术有限公司 基于语音交互的账号注册的方法和系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081259B (zh) * 2019-12-18 2022-04-15 思必驰科技股份有限公司 基于说话人扩充的语音识别模型训练方法及系统
CN111147965A (zh) * 2019-12-24 2020-05-12 深圳市康米索数码科技有限公司 一种基于物联网的可语音操控的蓝牙音箱系统
CN111667818B (zh) * 2020-05-27 2023-10-10 北京声智科技有限公司 一种训练唤醒模型的方法及装置
CN111785303B (zh) * 2020-06-30 2024-04-16 合肥讯飞数码科技有限公司 模型训练方法、模仿音检测方法、装置、设备及存储介质
CN112687295A (zh) * 2020-12-22 2021-04-20 联想(北京)有限公司 一种输入控制方法及电子设备
CN112634859B (zh) * 2020-12-28 2022-05-03 思必驰科技股份有限公司 用于文本相关说话人识别的数据增强方法及系统
CN112735381B (zh) * 2020-12-29 2022-09-27 四川虹微技术有限公司 一种模型更新方法及装置
CN113035230B (zh) * 2021-03-12 2022-12-27 北京百度网讯科技有限公司 认证模型的训练方法、装置及电子设备
CN113593581B (zh) * 2021-07-12 2024-04-19 西安讯飞超脑信息科技有限公司 声纹判别方法、装置、计算机设备和存储介质
CN114006747A (zh) * 2021-10-28 2022-02-01 平安普惠企业管理有限公司 交互安全管理方法、装置、计算机设备及可读存储介质
CN114419740A (zh) * 2022-01-11 2022-04-29 平安普惠企业管理有限公司 基于人工智能的活体检测方法、装置、设备及存储介质
CN116959438A (zh) * 2022-04-18 2023-10-27 华为技术有限公司 唤醒设备的方法、电子设备和存储介质
CN115022087B (zh) * 2022-07-20 2024-02-27 中国工商银行股份有限公司 一种语音识别验证处理方法及装置
CN115188109A (zh) * 2022-07-26 2022-10-14 思必驰科技股份有限公司 设备音频解锁方法、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808567A (zh) * 2006-01-26 2006-07-26 覃文华 验证真人在场状态的声纹认证设备和其认证方法
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
WO2016003299A1 (en) * 2014-07-04 2016-01-07 Intel Corporation Replay attack detection in automatic speaker verification systems
CN106297772A (zh) * 2016-08-24 2017-01-04 武汉大学 基于扬声器引入的语音信号失真特性的回放攻检测方法
GB2541466A (en) * 2015-08-21 2017-02-22 Validsoft Uk Ltd Replay attack detection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103943111A (zh) * 2014-04-25 2014-07-23 海信集团有限公司 一种身份识别的方法及装置
CN104468522B (zh) * 2014-11-07 2017-10-03 百度在线网络技术(北京)有限公司 一种声纹验证方法和装置
CN104680375A (zh) * 2015-02-28 2015-06-03 优化科技(苏州)有限公司 电子支付真人活体身份验证系统
JP2017085445A (ja) * 2015-10-30 2017-05-18 オリンパス株式会社 音声入力装置
CN106531172B (zh) * 2016-11-23 2019-06-14 湖北大学 基于环境噪声变化检测的说话人语音回放鉴别方法及系统
CN107729078B (zh) * 2017-09-30 2019-12-03 Oppo广东移动通信有限公司 后台应用程序管控方法、装置、存储介质及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808567A (zh) * 2006-01-26 2006-07-26 覃文华 验证真人在场状态的声纹认证设备和其认证方法
WO2016003299A1 (en) * 2014-07-04 2016-01-07 Intel Corporation Replay attack detection in automatic speaker verification systems
GB2541466A (en) * 2015-08-21 2017-02-22 Validsoft Uk Ltd Replay attack detection
CN105139857A (zh) * 2015-09-02 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种自动说话人识别中针对语音欺骗的对抗方法
CN106297772A (zh) * 2016-08-24 2017-01-04 武汉大学 基于扬声器引入的语音信号失真特性的回放攻检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHUNLEI ZHANG: "An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, vol. 11, no. 4, 16 January 2017 (2017-01-16), pages 684 - 694, XP011649474 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111464519A (zh) * 2020-03-26 2020-07-28 支付宝(杭州)信息技术有限公司 基于语音交互的账号注册的方法和系统

Also Published As

Publication number Publication date
CN110459204A (zh) 2019-11-15

Similar Documents

Publication Publication Date Title
WO2019210796A1 (zh) 语音识别方法、装置、存储介质及电子设备
CN109726624B (zh) 身份认证方法、终端设备和计算机可读存储介质
CN106251874B (zh) 一种语音门禁和安静环境监控方法及系统
Ren et al. Sound-event classification using robust texture features for robot hearing
CN103475490B (zh) 一种身份验证方法及装置
CN108305615A (zh) 一种对象识别方法及其设备、存储介质、终端
CN105940407A (zh) 用于评估音频口令的强度的系统和方法
CN109448759A (zh) 一种基于气爆音的抗语音认证欺骗攻击检测方法
CN113330511B (zh) 语音识别方法、装置、存储介质及电子设备
CN112233698A (zh) 人物情绪识别方法、装置、终端设备及存储介质
CN104965589A (zh) 一种基于人脑智慧和人机交互的人体活体检测方法与装置
CN112507311A (zh) 一种基于多模态特征融合的高安全性身份验证方法
CN113327620A (zh) 声纹识别的方法和装置
CN105138886B (zh) 机器人生物体征识别系统
CN116883900A (zh) 一种基于多维生物特征的视频真伪鉴别方法和系统
CN113470653A (zh) 声纹识别的方法、电子设备和系统
CN116486789A (zh) 语音识别模型的生成方法、语音识别方法、装置及设备
CN114003883A (zh) 一种便携式的数字化身份验证设备及身份验证方法
Memon Multi-layered multimodal biometric authentication for smartphone devices
CN115222966A (zh) 对抗数据生成方法、装置、计算机设备及存储介质
Kita et al. Personal Identification with Face and Voice Features Extracted through Kinect Sensor
Hari et al. Comprehensive Research on Speaker Recognition and its Challenges
JP7287442B2 (ja) 情報処理装置、制御方法、及びプログラム
Duraibi A Secure Lightweight Voice Authentication System for IoT Smart Device Users
Wu et al. A Fingerprint and Voiceprint Fusion Identity Authentication Method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19796628

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19796628

Country of ref document: EP

Kind code of ref document: A1