WO2019140823A1 - 语音验证方法、装置、计算机设备和计算机可读存储介质 - Google Patents

语音验证方法、装置、计算机设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2019140823A1
WO2019140823A1 PCT/CN2018/088696 CN2018088696W WO2019140823A1 WO 2019140823 A1 WO2019140823 A1 WO 2019140823A1 CN 2018088696 W CN2018088696 W CN 2018088696W WO 2019140823 A1 WO2019140823 A1 WO 2019140823A1
Authority
WO
WIPO (PCT)
Prior art keywords
verified
feature
text
information
voiceprint feature
Prior art date
Application number
PCT/CN2018/088696
Other languages
English (en)
French (fr)
Inventor
黄创茗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019140823A1 publication Critical patent/WO2019140823A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • G10L21/0202
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to a voice verification method, apparatus, computer device, and computer readable storage medium.
  • the accuracy, volume and price of the sensor element have been greatly improved, so that the method of verifying the identity of the user by recognizing the biometric feature can also be realized on the mobile terminal. Identifying the user's voiceprint is a common verification method in traditional technology.
  • a voice verification method for verifying a voice verification method, apparatus, computer device, and storage medium.
  • a voice verification method includes:
  • the feature model is retrained according to the voiceprint feature to be verified.
  • the feature model that matches the current scene type and corresponds to the user identifier is updated using the retrained feature model.
  • a voice verification device comprising:
  • An information acquiring module configured to obtain voice information to be verified and a corresponding user identifier
  • An information extraction module configured to extract, from the to-be-verified voice information, a voiceprint feature to be verified and a text to be verified;
  • a type obtaining module configured to acquire a current scene type
  • a model querying module configured to query a feature model that matches the current scene type and corresponds to the user identifier
  • a feature conversion module configured to convert the text to be verified into a reference voiceprint feature by using the feature model
  • a feature comparison module configured to compare the voiceprint feature to be verified with the reference voiceprint feature, and obtain a voice verification result
  • a retraining module configured to retrain the feature model according to the voiceprint feature to be verified when the verification result indicates that the verification is passed;
  • a model update module configured to update, by using the retrained feature model, a feature model that matches the current scene type and corresponds to the user identifier.
  • a computer device comprising a memory and one or more processors having stored therein computer readable instructions, the computer readable instructions being executable by the processor to cause the one or more processors to execute The following steps:
  • the feature model is retrained according to the voiceprint feature to be verified.
  • the feature model that matches the current scene type and corresponds to the user identifier is updated using the retrained feature model.
  • One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
  • the feature model is retrained according to the voiceprint feature to be verified.
  • the feature model that matches the current scene type and corresponds to the user identifier is updated using the retrained feature model.
  • FIG. 1 is an application scenario diagram of a voice verification method in accordance with one or more embodiments.
  • FIG. 2 is a flow diagram of a voice verification method in accordance with one or more embodiments.
  • FIG. 3 is a schematic flow chart of a voice verification method in another embodiment.
  • FIG. 4 is a block diagram of a voice verification device in accordance with one or more embodiments.
  • Figure 5 is a block diagram of a voice verification apparatus in another embodiment.
  • FIG. 6 is a block diagram of a voice verification device in accordance with one or more embodiments.
  • Figure 7 is a block diagram of a voice verification apparatus in another embodiment.
  • FIG. 8 is a block diagram of a voice verification device in accordance with one or more embodiments.
  • FIG. 9 is a block diagram of a computer device in accordance with one or more embodiments.
  • the voice verification method provided by the present application can be applied to an application environment as shown in FIG. 1.
  • the terminal 110 communicates with the server 120 via the network through the network, and the user 100 operates the terminal 110 through the input device.
  • the terminal 110 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablets, and portable wearable devices, and the server 120 can be implemented by a stand-alone server or a server cluster composed of a plurality of servers.
  • a voice verification method is provided.
  • the method is applied to the terminal in FIG. 1 as an example.
  • the method is not limited to being implemented only on the terminal, and specifically includes The following steps:
  • the voice information to be verified is voice information verified in voice verification.
  • the user ID is the identifier of the user's identity.
  • the terminal after the terminal collects the voice information to be verified, the terminal sends the to-be-verified voice information to the server. After receiving the voice information to be verified, the server selects a user identifier corresponding to the terminal that sends the voice information to be verified.
  • the voiceprint feature is characteristic information of the voiceprint.
  • Voiceprint is the sound spectrum of voice information.
  • a feature is information that describes the characteristics common to an object, and the object can be a voiceprint.
  • the feature may be at least one of a MFCC (Mel Frequency Cepstrum Coefficient) feature, a PLP (perceptual linear prediction) feature, and an LPC (Linear Predictive Coding). It may be at least one of a spectrum, a nasal sound, a pronunciation, and a speech rate.
  • the voiceprint feature to be verified is the voiceprint feature that is verified in voice verification.
  • the text to be verified is the text information verified in the voice verification.
  • the text to be verified is specifically the information recorded in the form of text to be verified.
  • the server extracts the voiceprint feature to be verified and the text to be verified from the voice information to be verified, and feeds the extracted voiceprint feature to be verified and the text to be verified back to the corresponding terminal.
  • the scene type is the type of scene.
  • the scene is specifically a combination of a place, a time, a weather, an environment, and the like when the voice information to be verified is acquired.
  • the current scene type is specifically the type of the scene when the voice information to be verified is obtained.
  • the terminal acquires location information and time information when the voice information to be verified is collected, and sends the obtained location information and time information to the server.
  • the server obtains corresponding weather information and environment information according to the received location information and time information, and determines the current scene type of the terminal according to the location information, the time information, the weather information, and the environment information.
  • the feature model may specifically be a collection of voiceprint features of the user's individual, and the feature model may be used to simulate the voiceprint features of the user.
  • the server queries the database for a feature model that matches the current scene type and corresponds to the user identifier.
  • the reference voiceprint feature is a reference object of the voiceprint feature to be verified during voice verification.
  • the server converts the text to be verified into a voice information by a feature model, and extracts a reference voiceprint feature from the converted voice information.
  • the server feeds back the obtained voice verification result back to the terminal. If the voice verification result indicates that the verification is passed, the terminal unlocks the corresponding application according to the voice verification result. If the voice verification result indicates that the verification fails, the terminal reacquires the voice information to be verified.
  • the feature model is retrained according to the voiceprint feature to be verified.
  • the feature model is retrained according to the voiceprint feature to be verified. Specifically, the voiceprint feature and the feature model to be verified may be compared, and the voiceprint feature with high frequency appearing in the voiceprint feature to be verified is added to the feature model.
  • the voiceprint feature whose appearance frequency is higher than a preset threshold is selected from the voiceprint features to be verified, and the selected voiceprint features and features are selected.
  • the model is compared. If the selected voiceprint feature and the corresponding voiceprint feature in the feature model are less than the preset value, the selected voiceprint feature is added to the feature model.
  • the voiceprint feature and the text to be verified are extracted from the voice information to be verified.
  • the current scene type is obtained, and the feature model that matches the current scene type and corresponds to the user identifier is queried.
  • the voice information to be verified is obtained in the scenario corresponding to the current scene type, and the voice information to be verified and the current scene are obtained.
  • Type matching, the voiceprint feature to be verified also matches the current scene type.
  • the text to be verified is converted into a reference voiceprint feature by the feature model, and the reference voiceprint feature naturally also matches the current scene type.
  • the voice verification result obtained by comparing the voiceprint feature to be verified and the reference voiceprint feature can accurately reflect the voice information to be verified. Whether it is the user's own voice information, so that the user's own voice can be recognized when the user's voice changes.
  • the feature model matching the current scene type and corresponding to the user identifier is retrained and updated by using the voiceprint feature to be verified, and the validity of the feature model corresponding to the scene type can be improved, thereby improving The recall rate for voice verification.
  • the obtaining the to-be-verified voice information and the corresponding user identifier comprises: obtaining an identity verification instruction; acquiring a user identifier in response to the identity verification instruction; querying a text corresponding to the user identifier pre-configuration; When the text is queried, the text is randomly generated; the randomly generated text is fed back; and the voice information to be verified that matches the text of the feedback is collected.
  • the authentication command is an instruction to activate voice verification.
  • the pre-configured text is specifically text information corresponding to the voice information used to authenticate the user.
  • the text is randomly generated, and the text information may be randomly selected in the text list, or the text information may be randomly generated according to the dictionary.
  • the terminal acquires an identity verification instruction triggered by the user through the touch screen, and obtains a corresponding user identifier in the database in response to the identity verification instruction. After obtaining the user identifier, querying the pre-configured corresponding to the user identifier. text. When the pre-configured text is queried, the identification of the voice information being collected is displayed on the display of the terminal. When the pre-configured text is not queried, the text is randomly generated according to the dictionary, the randomly generated text is displayed on the display screen, and the voice information to be verified is collected.
  • the terminal acquires an authentication command triggered by the user through the touch screen, and the authentication command is fed back to the value server.
  • the server obtains the corresponding user ID in the database and queries the pre-configured text corresponding to the user ID.
  • an instruction to start collecting the voice information to be verified is fed back to the terminal.
  • the pre-configured text is not queried, the text is randomly generated according to the dictionary, and the randomly generated text is sent to the terminal.
  • the pre-configured text corresponding to the user identifier is queried. If the pre-configured text is queried, the voice information to be verified can be directly collected, so that the voice verification is fast. If the pre-configured text is not queried, random text generation can also improve security.
  • the extracting the voiceprint feature to be verified and the text to be verified from the to-be-verified voice information comprises: parsing the voice information to be verified to obtain a corresponding sound wave signal; and framing the sound wave signal to obtain Acoustic signal of each frame; performing Fourier transform on the acoustic signal of each frame to obtain a corresponding spectrum; extracting a single frame voiceprint feature from the spectrum; and generating the to-be-verified speech according to a single frame voiceprint feature of each frame The voiceprint feature of the information; the voiceprint feature is converted to the text to be verified.
  • the acoustic signal is information about the frequency and amplitude of the sound wave.
  • the acoustic signal is specifically based on the frequency of the sound as the ordinate and the time as the abscissa, reflecting the information of the frequency of the sound as a function of time. Framing is to set a number of consecutive time points as one frame.
  • the sound wave signal is divided into frames, and specifically, the sound wave signal is divided into a plurality of sound wave signals having a frame length and a frame length according to a preset frame length.
  • the Fourier transform is a formula that converts a time domain function into a frequency domain function.
  • the spectrum is information about the frequency distribution of the sound.
  • the spectrum is specifically the frequency of the sound as the abscissa, the amplitude of the frequency component and its phase are the ordinate, which represents the distribution of the magnitude of the sine wave of each frequency at a static time point.
  • Fourier transform is performed on the acoustic signal of each frame to obtain a corresponding spectrum, and specifically, a trigonometric function corresponding to the acoustic signal of each frame is converted into a spectrum in each frame time.
  • the terminal parses the voice information to be verified, obtains a corresponding sound wave signal, frames the sound wave signal, and performs a Fourier transform on the signal obtained by multiplying the framed sound wave signal with the window function to obtain a corresponding Spectrum. Extracting a single frame voiceprint feature from the spectrum, generating a voiceprint feature of the voice information to be verified according to a single frame voiceprint feature of each frame, and determining each frame according to a state number corresponding to the voiceprint feature of each frame of the sound wave signal The state of the acoustic signal, and the determined states are combined to obtain corresponding characters, and the text to be verified is generated based on the obtained characters.
  • the window function is a function of truncating the acoustic signal.
  • the method further includes: collecting current noise information; generating an anti-interference model according to the collected noise information; and after analyzing the obtained acoustic signal, correcting the parsed acoustic signal by the anti-interference model, and executing The step of dividing the acoustic signal into frames to obtain an acoustic signal for each frame.
  • the noise signal is a sound signal that interferes with the verification of the voice information.
  • the noise signal may specifically be at least one of a sound emitted by the surrounding environment, such as a wind sound, a rain sound, and a reading sound.
  • the anti-interference model is specifically a model for filtering noise signals in the acoustic signal to be verified.
  • the acoustic wave signal obtained by the analysis is corrected by the anti-interference model, and specifically, the anti-interference model may be superimposed on the parsed acoustic signal, or the anti-interference model may be filtered from the parsed acoustic signal.
  • the sound wave signal can be corrected according to the anti-interference model, so that the parsed sound wave signal is more accurate, and the accuracy of the voiceprint verification is improved.
  • the acquiring the current scene type includes: acquiring time information and/or geographic location information of the to-be-verified voice information; and querying a preset scene type that matches the time information and/or the geographic location information; The preset scene type that is queried is used as the current scene type.
  • the time information is the time at which the voice information to be verified is collected.
  • the time information specifically includes the date and the time point of the day, and the time points in the day include hours, minutes, and seconds.
  • the geographical location information is the geographic location where the voice information to be verified is collected.
  • the geographical location information specifically includes a city logo and a building identifier, and the building identifier may specifically be at least one of a sports field, a house, a hospital, a company, a subway station, and a road.
  • the terminal acquires an intra-day time point for collecting the to-be-verified voice information, for example, 6 o'clock in the morning, and then obtains the geographical location information of the terminal, for example, Shenzhen Bay Park in Nanshan District, Shenzhen, according to the sensor on the terminal.
  • the acquisition terminal is moving within 30 minutes before the voice information to be verified is obtained, and the average speed is 8 km/h, the preset scene type is “outdoor jogging”, and the terminal uses “outdoor jogging” as the current Scene type.
  • the terminal obtains the current geographic location information, for example, at home, the preset preset scene type is “home”, and “home” is used as the current scene type.
  • the terminal detects that the connected WIFI (Wireless Fidelity) is a preset secure WIFI, and the preset preset scene type is “secure location” and “ Safe Location” as the current scene type.
  • WIFI Wireless Fidelity
  • the matching preset scene type is queried, and the preset preset scene type is used as the current scene type, and the corresponding feature model can be selected. Therefore, the scene type matched by the to-be-verified voice information and the scene type matched by the feature model are consistent, so as to minimize the image of the scene to be verified, thereby improving the return rate of the voice verification.
  • the acquiring the current scene type includes: acquiring time information and geographic location information for collecting the to-be-verified voice information; searching for weather information matching the time information and the geographic location information; querying and the weather information
  • the preset preset scene type; the preset preset scene type is used as the current scene type.
  • Weather information is information about weather phenomena in a region.
  • the weather information includes temperature, air pressure, humidity, wind, clouds, fog, rain, flash, snow, frost, thunder, cockroaches, cockroaches, etc.
  • the terminal obtains the date and daytime of the voice information to be verified, for example, at 3:00 pm on December 18, and then obtains the geographical location information of the terminal, such as Ping An Building, Futian District, Shenzhen.
  • the current weather information is queried in the weather forecast system, such as cloudy, current temperature 12 degrees Celsius, northeast wind level 5, and compared with 5 degrees Celsius at 3 pm on December 17th, the query is obtained.
  • the default scene type is “easy to catch cold”.
  • the "easily cold" that is queried is taken as the current scene type.
  • the time information and/or the geographical location information of the voice information to be verified are acquired, the matched weather information is queried, and the preset scene type matching the weather information is queried, and the preset preset scene type is used as the current
  • the scene type can be selected to the corresponding feature model, so that the scene type matching the voice information to be verified and the scene type matched by the feature model are consistent, thereby minimizing the image of the scene to be verified, thereby improving the return rate of the voice verification.
  • the method further includes: acquiring a common feature model; acquiring a training speech sample corresponding to the preset scene type and the user identifier; retraining the common feature model according to the training speech sample, and obtaining The preset scene type matches the feature model of the user identifier.
  • the public feature model is a generic feature model.
  • the public feature model is specifically a feature model common to the same type of sound, such as male voice, child voice or female voice.
  • the training speech sample is the speech information collected by the training feature model. Specifically, the period of collecting the training speech samples is between one month and three months after the selection of the common feature model, and the specific time depends on the frequency of collecting the training speech samples.
  • the server selects a GMM-UBM (Gaussian Markov Model-Uniform Background Model) matching the user's voiceprint in the model library, and collects the training speech during the training period.
  • the sample continuously training the GMM-UBM, trains the GMM-UBM into a feature model that matches the user's user identification.
  • the server trains the GMM-UBM, it is detected that the voiceprint feature of the training voice sample and the voiceprint feature collected at other times have a large change, and the scene information such as the geographic location information, the time information, and the weather information of the terminal is acquired, and the information will be obtained.
  • the scene information that is obtained is identified as the scene type.
  • the feature model can be quickly trained, so that the efficiency becomes high.
  • a voice verification method is further provided, and the method specifically includes the following steps:
  • the terminal acquires an identity verification instruction.
  • the terminal acquires a user identifier in response to the identity verification instruction.
  • the terminal queries the text corresponding to the pre-configuration of the user identifier.
  • S312 The terminal collects current noise information.
  • the terminal collects the to-be-verified voice information that matches the returned text.
  • the terminal feeds back the collected noise information and the voice information to be verified to the server.
  • the server generates an anti-interference model according to the noise information.
  • the server parses the to-be-verified voice information to obtain a corresponding sound wave signal.
  • the server After analyzing the obtained acoustic wave signal, the server corrects the analyzed acoustic wave signal by using the anti-interference model.
  • the server divides the sound wave signal into frames to obtain an acoustic wave signal of each frame.
  • the server performs Fourier transform on the acoustic signal of each frame to obtain a corresponding spectrum.
  • the server extracts a single frame voiceprint feature from the spectrum.
  • the server generates a voiceprint feature of the voice information to be verified according to a single frame voiceprint feature of each frame.
  • the server converts the voiceprint feature into text to be verified.
  • the terminal acquires time information and geographic location information for collecting the to-be-verified voice information.
  • the server queries a preset scene type that matches the weather information.
  • S340 The server uses the preset preset scene type as the current scene type.
  • the server queries a feature model that matches the current scene type and corresponds to the user identifier.
  • the server converts the text to be verified into a reference voiceprint feature by using the feature model.
  • the server compares the voiceprint feature to be verified and the reference voiceprint feature to obtain a voice verification result.
  • the voice verification method after acquiring the to-be-verified voice information and the corresponding user identifier, extracts the voiceprint feature and the text to be verified from the voice information to be verified.
  • the current scene type is obtained, and the feature model that matches the current scene type and corresponds to the user identifier is queried.
  • the voice information to be verified is obtained in the scenario corresponding to the current scene type, and the voice information to be verified and the current scene are obtained.
  • Type matching, the voiceprint feature to be verified also matches the current scene type.
  • the text to be verified is converted into a reference voiceprint feature by the feature model, and the reference voiceprint feature naturally also matches the current scene type.
  • the voice verification result obtained by comparing the voiceprint feature to be verified and the reference voiceprint feature can accurately reflect the voice information to be verified. Whether it is the user's own voice information, so that the user's own voice can be recognized when the user's voice changes.
  • the feature model matching the current scene type and corresponding to the user identifier is retrained and updated by using the voiceprint feature to be verified, and the validity of the feature model corresponding to the scene type can be improved, thereby improving The recall rate for voice verification.
  • a voice verification apparatus 400 including: an information acquisition module 402, an information extraction module 404, a type acquisition module 406, a model query module 408, a feature conversion module 410, and features.
  • the comparison module 412, the retraining module 413, and the model update module 415 wherein: the information acquisition module 402 is configured to obtain the voice information to be verified and the corresponding user identifier; and the information extraction module 404 is configured to extract the to-be-verified voice information.
  • a verified voiceprint feature and a text to be verified a type obtaining module 406, configured to acquire a current scene type; a model querying module 408, configured to query a feature model that matches the current scene type and corresponds to the user identifier; and the feature conversion module 410, configured to convert the to-be-verified text into a reference voiceprint feature by using the feature model; the feature comparison module 412 is configured to compare the voiceprint feature to be verified and the reference voiceprint feature to obtain a voice verification result; The module 413 is configured to: when the voice verification result indicates that the verification is passed, according to the voiceprint feature to be verified The levy model is retrained; the model updating module 415 is configured to update the feature model that matches the current scene type and corresponds to the user identifier by using the retrained feature model.
  • the voice verification device 400 after acquiring the voice information to be verified and the corresponding user identifier, extracts the voiceprint feature and the text to be verified from the voice information to be verified.
  • the current scene type is obtained, and the feature model that matches the current scene type and corresponds to the user identifier is queried.
  • the voice information to be verified is obtained in the scenario corresponding to the current scene type, and the voice information to be verified and the current scene are obtained.
  • Type matching, the voiceprint feature to be verified also matches the current scene type.
  • the text to be verified is converted into a reference voiceprint feature by the feature model, and the reference voiceprint feature naturally also matches the current scene type.
  • the voice verification result obtained by comparing the voiceprint feature to be verified and the reference voiceprint feature can accurately reflect the voice information to be verified. Whether it is the user's own voice information, so that the user's own voice can be recognized when the user's voice changes.
  • the feature model matching the current scene type and corresponding to the user identifier is retrained and updated by using the voiceprint feature to be verified, and the validity of the feature model corresponding to the scene type can be improved, thereby improving The recall rate for voice verification.
  • the information obtaining module 402 includes: an instruction obtaining module 402a, configured to obtain an identity verification instruction; and an identifier obtaining module 402b, configured to obtain a user identifier in response to the identity verification instruction;
  • the text query module 402c is configured to query the text corresponding to the user identifier pre-configuration;
  • the text generating module 402d is configured to randomly generate the text when the text is not queried;
  • the text feedback module 402e is configured to feed back the randomly generated text.
  • the information collection module 402f is configured to collect the to-be-verified voice information that matches the returned text.
  • the information extraction module 404 includes: an information parsing module 404a for parsing the to-be-verified speech information to obtain a corresponding acoustic signal; and a signal framing module 404b for The sound wave signal is framed to obtain an acoustic wave signal of each frame; the signal transforming module 404c is configured to perform Fourier transform on the sound wave signal of each frame to obtain a corresponding spectrum; and the feature extraction module 404d is configured to extract a single data from the spectrum.
  • a frame generating feature 404e configured to generate a voiceprint feature of the voice information to be verified according to a single frame voiceprint feature of each frame; and a text conversion module 404f configured to convert the voiceprint feature into a text to be verified .
  • the information acquisition module 402 is further configured to collect current noise information; the information extraction module 404 is further configured to generate an anti-interference model according to the collected noise information; and after the sound wave signal is parsed, pass the anti-interference After the model corrects the obtained acoustic wave signal, the step of dividing the acoustic wave signal into frames to obtain an acoustic signal for each frame is performed.
  • the type obtaining module 406 includes: a scene obtaining module 406a, configured to acquire time information and/or geographic location information of the voice information to be verified; and a type query module 406b.
  • the preset scene type is matched with the time information and/or the geographical location information; the type determining module 406c is configured to use the queried preset scene type as the current scene type.
  • the scene obtaining module 406a is further configured to acquire time information and geographic location information of the voice information to be verified.
  • the type obtaining module 406 further includes: a weather acquiring module 406d, configured to search for the time The information is matched with the weather information; the type query module 406b is further configured to query a preset scene type that matches the weather information; the type determining module 406c is further configured to use the preset scene type that is queried as Current scene type.
  • the voice verification apparatus 400 further includes: a model acquisition module 414, configured to acquire a common feature model, and a sample acquisition module 416, configured to acquire a preset scene type and the user The corresponding training speech sample is identified; the model training module 418 is configured to retrain the common feature model according to the training speech sample to obtain a feature model that matches the preset scene type and the user identifier.
  • Each of the above-described voice verification devices may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • a computer device which may be a terminal, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores operating systems and computer readable instructions.
  • the internal memory provides an environment for operation of an operating system and computer readable instructions in a non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions are executed by the processor to implement a voice verification method.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.
  • FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer apparatus comprising a memory and one or more processors having stored therein computer readable instructions that, when executed by a processor, cause one or more processors.
  • the computer device After obtaining the to-be-verified voice information and the corresponding user identifier, the computer device extracts the voiceprint feature and the text to be verified from the voice information to be verified.
  • the current scene type is obtained, and the feature model that matches the current scene type and corresponds to the user identifier is queried.
  • the voice information to be verified is obtained in the scenario corresponding to the current scene type, and the voice information to be verified and the current scene are obtained.
  • Type matching, the voiceprint feature to be verified also matches the current scene type.
  • the character to be verified is converted into a reference voiceprint feature by the feature model, and the reference voiceprint feature naturally also matches the current scene type.
  • the voice verification result obtained by comparing the voiceprint feature to be verified and the reference voiceprint feature can accurately reflect the voice information to be verified. Whether it is the user's own voice information, so that the user's own voice can be recognized when the user's voice changes.
  • the feature model matching the current scene type and corresponding to the user identifier is retrained and updated by using the voiceprint feature to be verified, and the validity of the feature model corresponding to the scene type can be improved, thereby improving The recall rate for voice verification.
  • non-volatile storage media having computer readable instructions that, when executed by one or more processors, cause one or more processors to implement the present The steps of applying the voice verification method provided in any of the embodiments.
  • the computer readable storage medium extracts the voiceprint feature and the text to be verified from the voice information to be verified after acquiring the voice information to be verified and the corresponding user identifier.
  • the current scene type is obtained, and the feature model that matches the current scene type and corresponds to the user identifier is queried.
  • the voice information to be verified is obtained in the scenario corresponding to the current scene type, and the voice information to be verified and the current scene are obtained.
  • Type matching, the voiceprint feature to be verified also matches the current scene type.
  • the text to be verified is converted into a reference voiceprint feature by the feature model, and the reference voiceprint feature naturally also matches the current scene type.
  • the voice verification result obtained by comparing the voiceprint feature to be verified and the reference voiceprint feature can accurately reflect the voice information to be verified. Whether it is the user's own voice information, so that the user's own voice can be recognized when the user's voice changes.
  • the feature model matching the current scene type and corresponding to the user identifier is retrained and updated by using the voiceprint feature to be verified, and the validity of the feature model corresponding to the scene type can be improved, thereby improving The recall rate for voice verification.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及一种身份验证方法,包括:获取待验证语音信息和相应的用户标识;从该待验证语音信息中提取待验证的声纹特征和待验证文本;获取当前场景类型;查询与该当前场景类型匹配、且与该用户标识对应的特征模型;通过该特征模型,将该待验证文本转换为参考声纹特征;比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果;当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。

Description

语音验证方法、装置、计算机设备和计算机可读存储介质
相关申请的交叉引用
本申请要求于2018年01月16日提交中国专利局,申请号为2018100417643,申请名称为“语音验证方法、装置、计算机设备和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种语音验证方法、装置、计算机设备和计算机可读存储介质。
背景技术
由于每两个人的生物特征都不相同,因此通过识别用户的生物特征,可以准确的确认用户的身份。识别人体的生物特征需要精密度高的传感器,而这些精密度高的传感器一般而言体积都较大。
目前,随着传感器元件的技术得到飞速的提升,传感器元件的精度、体积和价格都得到了长足的进步,因此在移动终端上也可以实现通过识别生物特征来验证用户身份的方法。而识别用户的声纹就是传统技术中较为常见的一种验证方法。
然而,发明人意识到,传统技术中的语音验证方法仅能够保证用户的声音不变的情况下验证成功,在用户声音出现变化时,都会使得传统的语音验证方法失效,验证的召回率很低。
发明内容
根据本申请公开的各种实施例,提供一种语音验证方法、装置、计算机设备和存储介质。
一种语音验证方法,包括:
获取待验证语音信息和相应的用户标识;
从该待验证语音信息中提取待验证的声纹特征和待验证文本;
获取当前场景类型;
查询与该当前场景类型匹配、且与该用户标识对应的特征模型;
通过该特征模型,将该待验证文本转换为参考声纹特征;
比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果;
当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;及
使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。
一种语音验证装置,包括:
信息获取模块,用于获取待验证语音信息和相应的用户标识;
信息提取模块,用于从该待验证语音信息中提取待验证的声纹特征和待验证文本;
类型获取模块,用于获取当前场景类型;
模型查询模块,用于查询与该当前场景类型匹配、且与该用户标识对应的特征模型;
特征转换模块,用于通过该特征模型,将该待验证文本转换为参考声纹特征;
特征比较模块,用于比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果
再训练模块,用于当该验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;及
模型更新模块,用于使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取待验证语音信息和相应的用户标识;
从该待验证语音信息中提取待验证的声纹特征和待验证文本;
获取当前场景类型;
查询与该当前场景类型匹配、且与该用户标识对应的特征模型;
通过该特征模型,将该待验证文本转换为参考声纹特征;
比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果;
当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;及
使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取待验证语音信息和相应的用户标识;
从该待验证语音信息中提取待验证的声纹特征和待验证文本;
获取当前场景类型;
查询与该当前场景类型匹配、且与该用户标识对应的特征模型;
通过该特征模型,将该待验证文本转换为参考声纹特征;
比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果;
当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;及
使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作 简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中语音验证方法的应用场景图。
图2为根据一个或多个实施例中语音验证方法的流程示意图。
图3为另一个实施例中语音验证方法的流程示意图。
图4为根据一个或多个实施例中语音验证装置的框图。
图5为另一个实施例中语音验证装置的框图。
图6为根据一个或多个实施例中语音验证装置的框图。
图7为另一个实施例中语音验证装置的框图。
图8为根据一个或多个实施例中语音验证装置的框图。
图9为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的语音验证方法,可以应用于如图1所示的应用环境中。终端110通过网络与服务器120通过网络进行通信,用户100通过输入装置操作终端110。终端110可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种语音验证方法,以该方法应用于图1中的终端为例进行说明,但该方法不限定于仅仅在终端上实施,具体包括以下步骤:
S202,获取待验证语音信息和相应的用户标识。
待验证语音信息是语音验证中被验证的语音信息。用户标识是用户身份的标识。
在其中一个实施例中,终端采集到待验证语音信息后,将该待验证语音信息发送至服务器。服务器接收到待验证语音信息后,选取与发送该待验证语音信息的终端相应的用户标识。
S204,从该待验证语音信息中提取待验证的声纹特征和待验证文本。
声纹特征是声纹的特征信息。声纹是语音信息的声波频谱。特征是描述客体共有的特性的信息,客体可以是声纹。特征具体可以是MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征、PLP(perceptual linear prediction,感知线性预测)特征和LPC(Linear Predictive Coding,线性预测分析)等中的至少一种,也可以是频谱、鼻音、发音和语速等中的至少一种。待验证的声纹特征是语音验证中被验证的声纹特征。待验证文本是语音验证中被验证的文本信息。待验证文本具体是待验证语音信息以文本形式记载的信息。
在其中一个实施例中,服务器从待验证语音信息中提取待验证的声纹特征和待验证文本,将提取出的待验证的声纹特征和待验证文本反馈回相应的终端。
S206,获取当前场景类型。
场景类型是场景的类型。场景具体是获取待验证语音信息时的地点、时间、天气和环境等的组合。当前场景类型具体是获取待验证语音信息时场景的类型。
在其中一个实施例中,终端获取采集待验证语音信息时的位置信息和时间信息,将获取的位置信息和时间信息发送至服务器。服务器根据接收到的位置信息和时间信息获取相应的天气信息和环境信息,并根据该位置信息、时间信息、天气信息和环境信息确定终端当前的场景类型。
S208,查询与该当前场景类型匹配、且与该用户标识对应的特征模型。
特征模型具体可以是用户个人的声纹特征的集合,特征模型可以用于模拟用户的声纹特征。
在其中一个实施例中,终端在将终端相应的当前场景类型和用户标识反馈至服务器后,服务器从数据库中查询与当前场景类型匹配、且与用户标识对应的特征模型。
S210,通过该特征模型,将该待验证文本转换为参考声纹特征。
参考声纹特征是语音验证时待验证声纹特征的参照对象。
在其中一个实施例中,服务器将待验证文本通过特征模型转换成语音信息,并从转换得到的语音信息中提取参考声纹特征。
S212,比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果。
在其中一个实施例中,服务器比较待验证的声纹特征和参考声纹特征后,将得到的语音验证结果反馈回终端。若该语音验证结果表示验证通过时,则终端根据该语音验证结果将相应的应用程序解锁。若该语音验证结果表示验证未通过时,则终端重新获取待验证语音信息。
S214,当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练。
根据待验证的声纹特征对特征模型进行再训练,具体可以是对比待验证的声纹特征和特征模型,将待验证的声纹特征中出现频率高的声纹特征加入特征模型中。
在其中一个实施例中,服务器在检测到语音验证结果表示验证通过时,则从待验证的声纹特征中选取出现频率高于预设阈值的声纹特征,将选取出的声纹特征与特征模型进行对比,若选取出的声纹特征与特征模型中相应的声纹特征相差小于预设值,则将选取出的声纹特征加入特征模型中。
S216,使用再训练后的特征模型更新与该当前场景类型和该用户标识匹配的特征模型。
本实施例中,在获取待验证语音信息和相应的用户标识后,从待验证语音信息中提取声纹特征和待验证文本。通过获取当前的场景类型,查询与当前场景类型匹配、且与用户标识对应的特征模型,由于待验证语音信息是在当前的场景类型相应的场景下获取到的,因此待验证语音信息与当前场景类型匹配,待验证的声纹特征也与当前场景类型匹配。通过特征模型将待验证文本转换为参考声纹特征,该参考声纹特征自然也与当前场景类型匹配。在参考声纹特征和待验证的声纹特征都与当前场景类型匹配时,通过比较该待验证的声纹特征和该参考声纹特征,得到的语音验证结果就可以准确地反映待验证语音信息是否是用户本人的语音信息,从而可以在用户声音发生变化时,也能够识别出用户本人的声音。而且在验证通过 时,使用待验证的声纹特征对与当前场景类型匹配、且与用户标识对应的特征模型再训练并更新,也可以提高与这个场景类型相应的特征模型的有效性,进而提高语音验证的召回率。
在其中一个实施例中,该获取待验证语音信息和相应的用户标识,包括:获取身份验证指令;响应于该身份验证指令,获取用户标识;查询对应于该用户标识预配置的文本;当未查询到该文本时,随机生成文本;反馈随机生成的该文本;采集与反馈的该文本相匹配的待验证语音信息。
身份验证指令是激活语音验证的指令。预配置的文本具体是用于认证用户身份的语音信息相应的文本信息。随机生成文本,具体可以是在文本列表中随机选取文本信息,也可以是根据字典随机生成文本信息。
在其中一个实施例中,终端获取用户通过触摸屏触发的身份验证指令,响应于该身份验证指令,在数据库中获取相应的用户标识,在获取用户标识后,查询对应于该用户标识的预配置的文本。当查询到预配置的文本时,在终端的显示屏上显示正在采集语音信息的标识。当未查询到预配置的文本时,根据字典随机生成文本,将随机生成的文本在显示屏上显示,并采集待验证语音信息。
在其中一个实施例中,终端获取用户通过触摸屏触发的身份验证指令,将该身份验证指令反馈值服务器。服务器在数据库中获取相应的用户标识,并查询对应于该用户标识的预配置的文本。当查询到预配置的文本时,向终端反馈开始采集待验证语音信息的指令。当未查询到预配置的文本时,根据字典随机生成文本,将随机生成的文本发送至终端。
本实施例中,通过获取用户标识,查询对应于用户标识的预配置的文本。如果查询到预配置的文本,就可以直接采集待验证语音信息,使得语音验证很快捷。如果未查询到预配置的文本,则随机生成文本,也能够提高安全性。
在其中一个实施例中,该从该待验证语音信息中提取待验证的声纹特征和待验证文本,包括:解析该待验证语音信息,得到相应的声波信号;将该声波信号分帧,得到每一帧的声波信号;对该每一帧的声波信号进行傅立叶变换,得到相应的频谱;从该频谱中提取单帧声纹特征;根据每一帧的单帧声纹特征生成该待验证语音信息的声纹特征;将该声纹特征转化为待验证文本。
声波信号是声波的频率和幅度变化的信息。声波信号具体是以声音的频率为纵坐标,以时间为横坐标,反映声音的频率随时间变化的信息。分帧是将连续的若干个时间点设为一帧。将声波信号分帧,具体可以是将声波信号按照预设的帧长,将一个完整的声波信号划分为若干个横坐标区间大小为帧长的声波信号。
傅立叶变换是将时域函数转换成频域函数的公式。频谱是声音的频率分布的信息。频谱具体是以声音的频率为横坐标,频率分量的振幅及其相位为纵坐标,表示的是一个静态时间点上各频率正弦波的幅值大小的分布状况。对每一帧的声波信号进行傅立叶变换,得到相应的频谱,具体可以是将每一帧的声波信号相应的三角函数,转换成每一帧时间内的频谱。
在其中一个实施例中,终端解析待验证语音信息,得到相应的声波信号,将该声波信号分帧,并将分帧后的声波信号与窗函数相乘后得到的信号进行傅立叶变换,得到相应的频谱。 从频谱中提取单帧声纹特征,根据每一帧的单帧声纹特征生成该待验证语音信息的声纹特征,根据每一帧声波信号的声纹特征相应的状态号,确定每一帧声波信号的状态,并将确定的状态进行组合,得到相应的字符,根据得到的字符生成待验证文本。窗函数是对声波信号进行截断的函数。
本实施例中,通过将声波信号转换成频谱,可以获得待验证语音信息中更多的信息,从而获取更多的声纹特征,使得语音验证更加准确。
在其中一个实施例中,该方法还包括:采集当前的噪音信息;根据采集的噪音信息生成抗干扰模型;在解析得到声波信号后,通过该抗干扰模型将解析得到的声波信号修正后,执行该将该声波信号分帧,得到每一帧的声波信号的步骤。
噪音信号是对待验证语音信息造成干扰的声音信号。噪音信号具体可以是周围环境发出的声音,例如风声、雨声和读书声等中的至少一种。抗干扰模型具体是用于过滤待验证的声波信号中噪音信号的模型。通过抗干扰模型将解析得到的声波信号修正,具体可以是将抗干扰模型与解析得到的声波信号叠加,也可以是从解析得到的声波信号中滤去抗干扰模型。
本实施例中,通过采集当前的噪音信号,生成抗干扰模型,可以根据抗干扰模型修正声波信号,从而使得解析得到的声波信号更加的精准,提高了声纹验证的准确率。
在其中一个实施例中,该获取当前场景类型包括:获取采集该待验证语音信息的时间信息和/或地理位置信息;查询与该时间信息和/或地理位置信息相匹配的预设场景类型;将查询到的预设场景类型作为当前场景类型。
时间信息是采集待验证语音信息的时间。时间信息具体包括日期和日内时间点,日内时间点包括时、分和秒。地理位置信息是采集待验证语音信息所在的地理位置。地理位置信息具体包括城市标识和建筑标识,建筑标识具体可以是运动场、住宅、医院、公司、地铁站和马路等中的至少一种。
在其中一个实施例中,终端获取采集待验证语音信息的日内时间点,例如是早晨6点整,再获取终端当前所在的地理位置信息,例如是深圳南山区深圳湾公园,根据终端上的传感器获取到终端在获取到待验证语音信息之前的30分钟内都在移动,且保持匀速8千米每小时,则查询到预设场景类型为“户外慢跑”,则终端将“户外慢跑”作为当前场景类型。
在其中一个实施例中,终端获取到当前所在的地理位置信息,例如是在家中,则直接选取的预设场景类型为“家中”,并将“家中”作为当前场景类型。
在其中一个实施例中,终端检测到连接的WIFI(Wireless Fidelity,基于IEEE 802.11b标准的无线局域网)为预设的安全WIFI,则直接选取的预设场景类型为“安全位置”,并将“安全位置”作为当前场景类型。
本实施例中,通过获取采集待验证语音信息的时间信息和/或地理位置信息,查询匹配的预设场景类型,将查询到的预设场景类型作为当前场景类型,可以选取到相应的特征模型,从而使得待验证语音信息匹配的场景类型和特征模型匹配的场景类型一致,从而尽可能减小场景对待验证语音信息的影像,进而提高语音验证的返回率。
在其中一个实施例中,该获取当前场景类型包括:获取采集该待验证语音信息的时间信 息和地理位置信息;查找与该时间信息和该地理位置信息相匹配的天气信息;查询与该天气信息相匹配的预设场景类型;将查询到的预设场景类型作为当前场景类型。
天气信息是一个地区内天气现象的信息。天气信息具体包括气温、气压、湿度、风、云、雾、雨、闪、雪、霜、雷、雹、霾等。
在其中一个实施例中,终端获取采集待验证语音信息的日期和日内时间点,例如是12月18日下午3点整,再获取终端当前所在的地理位置信息,例如深圳市福田区平安大厦,根据获取的日期和地理位置信息在天气预报系统中查询当前的天气信息,例如多云、当前温度12摄氏度、东北风5级,以及对比12月17日下午3点整降温5摄氏度,则查询到的预设场景类型为“易感冒”。将查询到的“易感冒”作为当前场景类型。
本实施例中,通过获取采集待验证语音信息的时间信息和/或地理位置信息,查询匹配的天气信息,并查询与天气信息匹配的预设场景类型,将查询到的预设场景类型作为当前场景类型,可以选取到相应的特征模型,从而使得待验证语音信息匹配的场景类型和特征模型匹配的场景类型一致,从而尽可能减小场景对待验证语音信息的影像,进而提高语音验证的返回率。
在其中一个实施例中,该方法还包括:获取公共特征模型;获取与预设场景类型和该用户标识相对应的训练语音样本;根据该训练语音样本将该公共特征模型进行再训练,得到与该预设场景类型和该用户标识相匹配的特征模型。
公共特征模型是通用的特征模型。公共特征模型具体是同一种类型的声音所通用的特征模型,例如男声、童声或女声等。训练语音样本是训练特征模型所采集的语音信息。具体地,采集训练语音样本的时期在选取公共特征模型后一个月至三个月之间,具体时间取决于采集训练语音样本的频率。
在其中一个实施例中,服务器在模型库中选取与用户的声纹匹配的GMM-UBM(Gaussian Markov Model-Uniform Background Model,高斯混合模型—通用背景模型),在训练期内通过采集的训练语音样本,不断地训练GMM-UBM,将GMM-UBM训练成与用户的用户标识相匹配的特征模型。在服务器对GMM-UBM训练时,检测到训练语音样本的声纹特征与其它时间收集到的声纹特征变化较大,则获取终端的地理位置信息、时间信息和天气信息等场景信息,将获取到的场景信息标识为场景类型。
本实施例中,通过使用训练语音样本对公共特征模型进行再训练,可以快速地训练出特征模型,使得效率变高。
如图3所示,在其中一个实施例中,还提供了一种语音验证方法,该方法具体包括以下的步骤:
S302,终端获取身份验证指令。
S304,终端响应于该身份验证指令,获取用户标识。
S306,终端查询对应于该用户标识预配置的文本。
S308,当终端未查询到该文本时,随机生成文本。
S310,终端反馈随机生成的该文本。
S312,终端采集当前的噪音信息。
S314,终端采集与反馈的该文本相匹配的待验证语音信息。
S316,终端将采集的噪音信息和待验证语音信息反馈至服务器。
S318,服务器根据噪音信息生成抗干扰模型。
S320,服务器解析该待验证语音信息,得到相应的声波信号。
S322,服务器在解析得到声波信号后,通过该抗干扰模型将解析得到的声波信号修正后。
S324,服务器将该声波信号分帧,得到每一帧的声波信号。
S326,服务器对该每一帧的声波信号进行傅立叶变换,得到相应的频谱。
S328,服务器从该频谱中提取单帧声纹特征。
S330,服务器根据每一帧的单帧声纹特征生成该待验证语音信息的声纹特征。
S332,服务器将该声纹特征转化为待验证文本。
S334,终端获取采集该待验证语音信息的时间信息和地理位置信息。
S336,终端将时间信息和地理位置信息反馈至服务器后,服务器查找与该时间信息和该地理位置信息相匹配的天气信息。
S338,服务器查询与该天气信息相匹配的预设场景类型。
S340,服务器将查询到的预设场景类型作为当前场景类型。
S342,服务器查询与该当前场景类型匹配、且与该用户标识对应的特征模型。
S344,服务器通过该特征模型,将该待验证文本转换为参考声纹特征。
S346,服务器比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果。
上述语音验证方法,在获取待验证语音信息和相应的用户标识后,从待验证语音信息中提取声纹特征和待验证文本。通过获取当前的场景类型,查询与当前场景类型匹配、且与用户标识对应的特征模型,由于待验证语音信息是在当前的场景类型相应的场景下获取到的,因此待验证语音信息与当前场景类型匹配,待验证的声纹特征也与当前场景类型匹配。通过特征模型将待验证文本转换为参考声纹特征,该参考声纹特征自然也与当前场景类型匹配。在参考声纹特征和待验证的声纹特征都与当前场景类型匹配时,通过比较该待验证的声纹特征和该参考声纹特征,得到的语音验证结果就可以准确地反映待验证语音信息是否是用户本人的语音信息,从而可以在用户声音发生变化时,也能够识别出用户本人的声音。而且在验证通过时,使用待验证的声纹特征对与当前场景类型匹配、且与用户标识对应的特征模型再训练并更新,也可以提高与这个场景类型相应的特征模型的有效性,进而提高语音验证的召回率。
应该理解的是,虽然图2和3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图4所示,提供了一种语音验证装置400,包括:信息获取模块402、信息提取模块404、类型获取模块406、模型查询模块408、特征转换模块410、特征比较模块412、再训练模块413和模型更新模块415,其中:信息获取模块402,用于获取待验证语音信息和相应的用户标识;信息提取模块404,用于从该待验证语音信息中提取待验证的声纹特征和待验证文本;类型获取模块406,用于获取当前场景类型;模型查询模块408,用于查询与该当前场景类型匹配、且与该用户标识对应的特征模型;特征转换模块410,用于通过该特征模型,将该待验证文本转换为参考声纹特征;特征比较模块412,用于比较该待验证的声纹特征和该参考声纹特征,得到语音验证结果;再训练模块413,用于当该语音验证结果表示验证通过时,则根据该待验证的声纹特征对该特征模型进行再训练;模型更新模块415,用于使用再训练后的特征模型更新与该当前场景类型匹配、且与该用户标识对应的特征模型。
上述语音验证装置400,在获取待验证语音信息和相应的用户标识后,从待验证语音信息中提取声纹特征和待验证文本。通过获取当前的场景类型,查询与当前场景类型匹配、且与用户标识对应的特征模型,由于待验证语音信息是在当前的场景类型相应的场景下获取到的,因此待验证语音信息与当前场景类型匹配,待验证的声纹特征也与当前场景类型匹配。通过特征模型将待验证文本转换为参考声纹特征,该参考声纹特征自然也与当前场景类型匹配。在参考声纹特征和待验证的声纹特征都与当前场景类型匹配时,通过比较该待验证的声纹特征和该参考声纹特征,得到的语音验证结果就可以准确地反映待验证语音信息是否是用户本人的语音信息,从而可以在用户声音发生变化时,也能够识别出用户本人的声音。而且在验证通过时,使用待验证的声纹特征对与当前场景类型匹配、且与用户标识对应的特征模型再训练并更新,也可以提高与这个场景类型相应的特征模型的有效性,进而提高语音验证的召回率。
如图5所示,在其中一个实施例中,信息获取模块402,包括:指令获取模块402a,用于获取身份验证指令;标识获取模块402b,用于响应于该身份验证指令,获取用户标识;文本查询模块402c,用于查询对应于该用户标识预配置的文本;文本生成模块402d,用于当未查询到该文本时,随机生成文本;文本反馈模块402e,用于反馈随机生成的该文本;信息采集模块402f,用于采集与反馈的该文本相匹配的待验证语音信息。
如图6所示,在其中一个实施例中,信息提取模块404,包括:信息解析模块404a,用于解析该待验证语音信息,得到相应的声波信号;信号分帧模块404b,用于将该声波信号分帧,得到每一帧的声波信号;信号变换模块404c,用于对该每一帧的声波信号进行傅立叶变换,得到相应的频谱;特征提取模块404d,用于从该频谱中提取单帧声纹特征;特征生成模块404e,用于根据每一帧的单帧声纹特征生成该待验证语音信息的声纹特征;文本转化模块404f,用于将该声纹特征转化为待验证文本。
在其中一个实施例中,信息获取模块402,还用于采集当前的噪音信息;信息提取模块404,还用于根据采集的噪音信息生成抗干扰模型;在解析得到声波信号后,通过该抗干扰模型将解析得到的声波信号修正后,执行该将该声波信号分帧,得到每一帧的声波信号的步骤。
如图7所示,在其中一个实施例中,类型获取模块406,包括:场景获取模块406a,用于获取采集该待验证语音信息的时间信息和/或地理位置信息;类型查询模块406b,用于查询与该时间信息和/或地理位置信息相匹配的预设场景类型;类型确定模块406c,用于将查询到的预设场景类型作为当前场景类型。
在其中一个实施例中,场景获取模块406a,还用于获取采集该待验证语音信息的时间信息和地理位置信息;上述类型获取模块406,还包括:天气获取模块406d,用于查找与该时间信息和该地理位置信息相匹配的天气信息;类型查询模块406b,还用于查询与该天气信息相匹配的预设场景类型;类型确定模块406c,还用于将查询到的预设场景类型作为当前场景类型。
如图8所示,在其中一个实施例中,上述语音验证装置400,还包括:模型获取模块414,用于获取公共特征模型;样本获取模块416,用于获取与预设场景类型和该用户标识相对应的训练语音样本;模型训练模块418,用于根据该训练语音样本将该公共特征模型进行再训练,得到与该预设场景类型和该用户标识相匹配的特征模型。
关于语音验证装置的具体限定可以参见上文中对于语音验证方法的限定,在此不再赘述。上述语音验证装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音验证方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在其中一个实施例中,提供了一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的语音验证方法的步骤。
上述计算机设备,在获取待验证语音信息和相应的用户标识后,从待验证语音信息中提取声纹特征和待验证文本。通过获取当前的场景类型,查询与当前场景类型匹配、且与用户标识对应的特征模型,由于待验证语音信息是在当前的场景类型相应的场景下获取到的,因此待验证语音信息与当前场景类型匹配,待验证的声纹特征也与当前场景类型匹配。通过特 征模型将待验证文本转换为参考声纹特征,该参考声纹特征自然也与当前场景类型匹配。在参考声纹特征和待验证的声纹特征都与当前场景类型匹配时,通过比较该待验证的声纹特征和该参考声纹特征,得到的语音验证结果就可以准确地反映待验证语音信息是否是用户本人的语音信息,从而可以在用户声音发生变化时,也能够识别出用户本人的声音。而且在验证通过时,使用待验证的声纹特征对与当前场景类型匹配、且与用户标识对应的特征模型再训练并更新,也可以提高与这个场景类型相应的特征模型的有效性,进而提高语音验证的召回率。
在其中一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的语音验证方法的步骤。
上述计算机可读存储介质,在获取待验证语音信息和相应的用户标识后,从待验证语音信息中提取声纹特征和待验证文本。通过获取当前的场景类型,查询与当前场景类型匹配、且与用户标识对应的特征模型,由于待验证语音信息是在当前的场景类型相应的场景下获取到的,因此待验证语音信息与当前场景类型匹配,待验证的声纹特征也与当前场景类型匹配。通过特征模型将待验证文本转换为参考声纹特征,该参考声纹特征自然也与当前场景类型匹配。在参考声纹特征和待验证的声纹特征都与当前场景类型匹配时,通过比较该待验证的声纹特征和该参考声纹特征,得到的语音验证结果就可以准确地反映待验证语音信息是否是用户本人的语音信息,从而可以在用户声音发生变化时,也能够识别出用户本人的声音。而且在验证通过时,使用待验证的声纹特征对与当前场景类型匹配、且与用户标识对应的特征模型再训练并更新,也可以提高与这个场景类型相应的特征模型的有效性,进而提高语音验证的召回率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音验证方法,包括:
    获取待验证语音信息和相应的用户标识;
    从所述待验证语音信息中提取待验证的声纹特征和待验证文本;
    获取当前场景类型;
    查询与所述当前场景类型匹配、且与所述用户标识对应的特征模型;
    通过所述特征模型,将所述待验证文本转换为参考声纹特征;
    比较所述待验证的声纹特征和所述参考声纹特征,得到语音验证结果;
    当所述语音验证结果表示验证通过时,则根据所述待验证的声纹特征对所述特征模型进行再训练;及
    使用再训练后的特征模型更新与所述当前场景类型匹配、且与所述用户标识对应的特征模型。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待验证语音信息和相应的用户标识,包括:
    获取身份验证指令;
    响应于所述身份验证指令,获取用户标识;
    查询对应于所述用户标识预配置的文本;
    当未查询到所述文本时,随机生成文本;
    反馈随机生成的所述文本;及
    采集与反馈的所述文本相匹配的待验证语音信息。
  3. 根据权利要求1所述的方法,其特征在于,所述从所述待验证语音信息中提取待验证的声纹特征和待验证文本,包括:
    解析所述待验证语音信息,得到相应的声波信号;
    将所述声波信号分帧,得到每一帧的声波信号;
    对所述每一帧的声波信号进行傅立叶变换,得到相应的频谱;
    从所述频谱中提取单帧声纹特征;
    根据每一帧的单帧声纹特征生成所述待验证语音信息的声纹特征;及
    将所述声纹特征转化为待验证文本。
  4. 根据权利要求3所述的方法,其特征在于,还包括:
    采集当前的噪音信息;
    根据采集的噪音信息生成抗干扰模型;及
    在解析得到声波信号后,通过所述抗干扰模型将解析得到的声波信号修正后,执行所述将所述声波信号分帧,得到每一帧的声波信号的步骤。
  5. 根据权利要求1所述的方法,其特征在于,所述获取当前场景类型包括:
    获取采集所述待验证语音信息的时间信息和/或地理位置信息;
    查询与所述时间信息和/或地理位置信息相匹配的预设场景类型;及
    将查询到的预设场景类型作为当前场景类型。
  6. 根据权利要求1所述的方法,其特征在于,所述获取当前场景类型包括:
    获取采集所述待验证语音信息的时间信息和地理位置信息;
    查找与所述时间信息和所述地理位置信息相匹配的天气信息;
    查询与所述天气信息相匹配的预设场景类型;及
    将查询到的预设场景类型作为当前场景类型。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,还包括:
    获取公共特征模型;
    获取与预设场景类型和所述用户标识相对应的训练语音样本;及
    根据所述训练语音样本将所述公共特征模型进行再训练,得到与所述预设场景类型和所述用户标识相匹配的特征模型。
  8. 一种语音验证装置,包括:
    信息获取模块,用于获取待验证语音信息和相应的用户标识;
    信息提取模块,用于从所述待验证语音信息中提取待验证的声纹特征和待验证文本;
    类型获取模块,用于获取当前场景类型;
    模型查询模块,用于查询与所述当前场景类型匹配、且与所述用户标识对应的特征模型;
    特征转换模块,用于通过所述特征模型,将所述待验证文本转换为参考声纹特征;
    特征比较模块,用于比较所述待验证的声纹特征和所述参考声纹特征,得到语音验证结果
    再训练模块,用于当所述语音验证结果表示验证通过时,则根据所述待验证的声纹特征对所述特征模型进行再训练;及
    模型更新模块,用于使用再训练后的特征模型更新与所述当前场景类型匹配、且与所述用户标识对应的特征模型。
  9. 根据权利要求6所述的装置,其特征在于,所述信息获取模块,包括:
    指令获取模块,用于获取身份验证指令;
    标识获取模块,用于响应于该身份验证指令,获取用户标识;
    文本查询模块,用于查询对应于该用户标识预配置的文本;
    文本生成模块,用于当未查询到该文本时,随机生成文本;
    文本反馈模块,用于反馈随机生成的该文本;
    信息采集模块,用于采集与反馈的该文本相匹配的待验证语音信息。
  10. 根据权利要求6所述的装置,其特征在于,所述信息提取模块,包括:
    信息解析模块,用于解析该待验证语音信息,得到相应的声波信号;
    信号分帧模块,用于将该声波信号分帧,得到每一帧的声波信号;
    信号变换模块,用于对该每一帧的声波信号进行傅立叶变换,得到相应的频谱;
    特征提取模块,用于从该频谱中提取单帧声纹特征;
    特征生成模块,用于根据每一帧的单帧声纹特征生成该待验证语音信息的声纹特征;
    文本转化模块,用于将该声纹特征转化为待验证文本。
  11. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待验证语音信息和相应的用户标识;
    从所述待验证语音信息中提取待验证的声纹特征和待验证文本;
    获取当前场景类型;
    查询与所述当前场景类型匹配、且与所述用户标识对应的特征模型;
    通过所述特征模型,将所述待验证文本转换为参考声纹特征;
    比较所述待验证的声纹特征和所述参考声纹特征,得到语音验证结果;
    当所述语音验证结果表示验证通过时,则根据所述待验证的声纹特征对所述特征模型进行再训练;及
    使用再训练后的特征模型更新与所述当前场景类型匹配、且与所述用户标识对应的特征模型。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取身份验证指令;
    响应于所述身份验证指令,获取用户标识;
    查询对应于所述用户标识预配置的文本;
    当未查询到所述文本时,随机生成文本;
    反馈随机生成的所述文本;及
    采集与反馈的所述文本相匹配的待验证语音信息。
  13. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    解析所述待验证语音信息,得到相应的声波信号;
    将所述声波信号分帧,得到每一帧的声波信号;
    对所述每一帧的声波信号进行傅立叶变换,得到相应的频谱;
    从所述频谱中提取单帧声纹特征;
    根据每一帧的单帧声纹特征生成所述待验证语音信息的声纹特征;及
    将所述声纹特征转化为待验证文本。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    采集当前的噪音信息;
    根据采集的噪音信息生成抗干扰模型;及
    在解析得到声波信号后,通过所述抗干扰模型将解析得到的声波信号修正后,执行所述将所述声波信号分帧,得到每一帧的声波信号的步骤。
  15. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取采集所述待验证语音信息的时间信息和/或地理位置信息;
    查询与所述时间信息和/或地理位置信息相匹配的预设场景类型;及
    将查询到的预设场景类型作为当前场景类型。
  16. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待验证语音信息和相应的用户标识;
    从所述待验证语音信息中提取待验证的声纹特征和待验证文本;
    获取当前场景类型;
    查询与所述当前场景类型匹配、且与所述用户标识对应的特征模型;
    通过所述特征模型,将所述待验证文本转换为参考声纹特征;
    比较所述待验证的声纹特征和所述参考声纹特征,得到语音验证结果;
    当所述语音验证结果表示验证通过时,则根据所述待验证的声纹特征对所述特征模型进行再训练;及
    使用再训练后的特征模型更新与所述当前场景类型匹配、且与所述用户标识对应的特征模型。
  17. 根据权利要求16所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取身份验证指令;
    响应于所述身份验证指令,获取用户标识;
    查询对应于所述用户标识预配置的文本;
    当未查询到所述文本时,随机生成文本;
    反馈随机生成的所述文本;及
    采集与反馈的所述文本相匹配的待验证语音信息。
  18. 根据权利要求16所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    解析所述待验证语音信息,得到相应的声波信号;
    将所述声波信号分帧,得到每一帧的声波信号;
    对所述每一帧的声波信号进行傅立叶变换,得到相应的频谱;
    从所述频谱中提取单帧声纹特征;
    根据每一帧的单帧声纹特征生成所述待验证语音信息的声纹特征;及
    将所述声纹特征转化为待验证文本。
  19. 根据权利要求18所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    采集当前的噪音信息;
    根据采集的噪音信息生成抗干扰模型;及
    在解析得到声波信号后,通过所述抗干扰模型将解析得到的声波信号修正后,执行所述将所述声波信号分帧,得到每一帧的声波信号的步骤。
  20. 根据权利要求16所述的存储介质,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取采集所述待验证语音信息的时间信息和/或地理位置信息;
    查询与所述时间信息和/或地理位置信息相匹配的预设场景类型;及
    将查询到的预设场景类型作为当前场景类型。
PCT/CN2018/088696 2018-01-16 2018-05-28 语音验证方法、装置、计算机设备和计算机可读存储介质 WO2019140823A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810041764.3 2018-01-16
CN201810041764.3A CN108305633B (zh) 2018-01-16 2018-01-16 语音验证方法、装置、计算机设备和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019140823A1 true WO2019140823A1 (zh) 2019-07-25

Family

ID=62869165

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/088696 WO2019140823A1 (zh) 2018-01-16 2018-05-28 语音验证方法、装置、计算机设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN108305633B (zh)
WO (1) WO2019140823A1 (zh)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109273009A (zh) * 2018-08-02 2019-01-25 平安科技(深圳)有限公司 门禁控制方法、装置、计算机设备和存储介质
CN108989349B (zh) * 2018-08-31 2022-11-29 平安科技(深圳)有限公司 用户账号解锁方法、装置、计算机设备及存储介质
CN109450850B (zh) * 2018-09-26 2022-10-11 深圳壹账通智能科技有限公司 身份验证方法、装置、计算机设备和存储介质
CN109446774B (zh) * 2018-09-30 2021-11-30 山东知味行网络科技有限公司 一种身份识别应用方法及系统
CN109147797B (zh) * 2018-10-18 2024-05-07 平安科技(深圳)有限公司 基于声纹识别的客服方法、装置、计算机设备及存储介质
CN109410938A (zh) * 2018-11-28 2019-03-01 途客电力科技(天津)有限公司 车辆控制方法、装置及车载终端
CN111292739B (zh) * 2018-12-10 2023-03-31 珠海格力电器股份有限公司 一种语音控制方法、装置、存储介质及空调
CN111312233A (zh) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 一种语音数据的识别方法、装置及系统
CN109410956B (zh) * 2018-12-24 2021-10-08 科大讯飞股份有限公司 一种音频数据的对象识别方法、装置、设备及存储介质
CN111445904A (zh) * 2018-12-27 2020-07-24 北京奇虎科技有限公司 基于云端的语音控制方法、装置及电子设备
CN112289325A (zh) * 2019-07-24 2021-01-29 华为技术有限公司 一种声纹识别方法及装置
CN110827799B (zh) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 用于处理语音信号的方法、装置、设备和介质
CN111415669B (zh) * 2020-04-15 2023-03-31 厦门快商通科技股份有限公司 一种声纹模型构建方法和装置以及设备
CN111653283B (zh) * 2020-06-28 2024-03-01 讯飞智元信息科技有限公司 一种跨场景声纹比对方法、装置、设备及存储介质
CN111795707A (zh) * 2020-07-21 2020-10-20 高超群 一种新能源汽车充电桩路线规划方法
CN111916053B (zh) * 2020-08-17 2022-05-20 北京字节跳动网络技术有限公司 语音生成方法、装置、设备和计算机可读介质
CN112447167A (zh) * 2020-11-17 2021-03-05 康键信息技术(深圳)有限公司 语音识别模型验证方法、装置、计算机设备和存储介质
CN112599137A (zh) * 2020-12-16 2021-04-02 康键信息技术(深圳)有限公司 验证声纹模型识别效果的方法、装置和计算机设备
CN112669820B (zh) * 2020-12-16 2023-08-04 平安科技(深圳)有限公司 基于语音识别的考试作弊识别方法、装置及计算机设备
CN112992174A (zh) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 一种语音分析方法及其语音记录装置
CN113066501A (zh) * 2021-03-15 2021-07-02 Oppo广东移动通信有限公司 语音启动终端的方法及装置、介质和电子设备
CN112992153B (zh) * 2021-04-27 2021-08-17 太平金融科技服务(上海)有限公司 音频处理方法、声纹识别方法、装置、计算机设备
CN113254897B (zh) * 2021-05-13 2024-01-05 北京达佳互联信息技术有限公司 信息验证方法、装置、服务器及存储介质
KR20240041318A (ko) * 2021-07-27 2024-03-29 퀄컴 인코포레이티드 콘텍스트 정보 및 사용자 감정을 사용한 음성 또는 스피치 인식

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130060569A1 (en) * 2005-07-27 2013-03-07 International Business Machines Corporation Voice authentication system and method using a removable voice id card
CN104104664A (zh) * 2013-04-11 2014-10-15 腾讯科技(深圳)有限公司 对验证码进行验证的方法、服务器、客户端和系统
CN104331652A (zh) * 2014-10-08 2015-02-04 无锡指网生物识别科技有限公司 指纹和语音识别的电子设备的动态密码生成方法
US20170140760A1 (en) * 2015-11-18 2017-05-18 Uniphore Software Systems Adaptive voice authentication system and method
CN106782569A (zh) * 2016-12-06 2017-05-31 深圳增强现实技术有限公司 一种基于声纹注册的增强现实方法及装置
CN107424613A (zh) * 2017-05-16 2017-12-01 鄂尔多斯市普渡科技有限公司 一种无人驾驶出租车的语音开门认证系统及其方法
CN107516526A (zh) * 2017-08-25 2017-12-26 百度在线网络技术(北京)有限公司 一种声源跟踪定位方法、装置、设备和计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1280137B1 (en) * 2001-07-24 2004-12-29 Sony International (Europe) GmbH Method for speaker identification
CN102708867A (zh) * 2012-05-30 2012-10-03 北京正鹰科技有限责任公司 一种基于声纹和语音的防录音假冒身份识别方法及系统
CN105635087B (zh) * 2014-11-20 2019-09-20 阿里巴巴集团控股有限公司 通过声纹验证用户身份的方法及装置
CN106356057A (zh) * 2016-08-24 2017-01-25 安徽咪鼠科技有限公司 一种基于计算机应用场景语义理解的语音识别系统
CN107481720B (zh) * 2017-06-30 2021-03-19 百度在线网络技术(北京)有限公司 一种显式声纹识别方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130060569A1 (en) * 2005-07-27 2013-03-07 International Business Machines Corporation Voice authentication system and method using a removable voice id card
CN104104664A (zh) * 2013-04-11 2014-10-15 腾讯科技(深圳)有限公司 对验证码进行验证的方法、服务器、客户端和系统
CN104331652A (zh) * 2014-10-08 2015-02-04 无锡指网生物识别科技有限公司 指纹和语音识别的电子设备的动态密码生成方法
US20170140760A1 (en) * 2015-11-18 2017-05-18 Uniphore Software Systems Adaptive voice authentication system and method
CN106782569A (zh) * 2016-12-06 2017-05-31 深圳增强现实技术有限公司 一种基于声纹注册的增强现实方法及装置
CN107424613A (zh) * 2017-05-16 2017-12-01 鄂尔多斯市普渡科技有限公司 一种无人驾驶出租车的语音开门认证系统及其方法
CN107516526A (zh) * 2017-08-25 2017-12-26 百度在线网络技术(北京)有限公司 一种声源跟踪定位方法、装置、设备和计算机可读存储介质

Also Published As

Publication number Publication date
CN108305633A (zh) 2018-07-20
CN108305633B (zh) 2019-03-29

Similar Documents

Publication Publication Date Title
WO2019140823A1 (zh) 语音验证方法、装置、计算机设备和计算机可读存储介质
US10204619B2 (en) Speech recognition using associative mapping
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
US10593336B2 (en) Machine learning for authenticating voice
US9548048B1 (en) On-the-fly speech learning and computer model generation using audio-visual synchronization
CN107928673B (zh) 音频信号处理方法、装置、存储介质和计算机设备
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
US11482242B2 (en) Audio recognition method, device and server
US20180277103A1 (en) Constructing speech decoding network for numeric speech recognition
EP3255631B1 (en) Dynamic password voice based identity authentication system and method having self-learning function
CN107731233B (zh) 一种基于rnn的声纹识别方法
WO2020043123A1 (zh) 命名实体识别方法、命名实体识别装置、设备及介质
US20210090561A1 (en) Alexa roaming authentication techniques
US20120102066A1 (en) Method, Devices and a Service for Searching
US7515770B2 (en) Information processing method and apparatus
CN110428854B (zh) 车载端的语音端点检测方法、装置和计算机设备
CN109377981B (zh) 音素对齐的方法及装置
CN110556126A (zh) 语音识别方法、装置以及计算机设备
CN106782508A (zh) 语音音频的切分方法和语音音频的切分装置
CN109448732B (zh) 一种数字串语音处理方法及装置
CN104732972A (zh) 一种基于分组统计的hmm声纹识别签到方法及系统
CN109947971A (zh) 图像检索方法、装置、电子设备及存储介质
Ghaemmaghami et al. Speaker attribution of australian broadcast news data
CN111737515B (zh) 音频指纹提取方法、装置、计算机设备和可读存储介质
CN110838294B (zh) 一种语音验证方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18901260

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18901260

Country of ref document: EP

Kind code of ref document: A1