WO2019214047A1 - 建立声纹模型的方法、装置、计算机设备和存储介质 - Google Patents

建立声纹模型的方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2019214047A1
WO2019214047A1 PCT/CN2018/094888 CN2018094888W WO2019214047A1 WO 2019214047 A1 WO2019214047 A1 WO 2019214047A1 CN 2018094888 W CN2018094888 W CN 2018094888W WO 2019214047 A1 WO2019214047 A1 WO 2019214047A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
voiceprint
speech
target user
input
Prior art date
Application number
PCT/CN2018/094888
Other languages
English (en)
French (fr)
Inventor
蔡元哲
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to JP2019570559A priority Critical patent/JP6906067B2/ja
Priority to US16/759,384 priority patent/US11322155B2/en
Priority to SG11202002083WA priority patent/SG11202002083WA/en
Publication of WO2019214047A1 publication Critical patent/WO2019214047A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, an apparatus, a computer device, and a storage medium for establishing a voiceprint model.
  • Voiceprint is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. Modern scientific research shows that voiceprints are not only specific but also relatively stable. After adulthood, the human voice can remain relatively stable for a long time.
  • the voiceprint recognition algorithm establishes a recognition model by learning various sound features in the sound map to confirm the speaker.
  • the current voiceprint recognition method works well for long sound texts (speaker voice lengths longer than 1 minute), but for short sound texts (speaker voice length is less than 1 minute, for example, about 20 seconds), the recognition The error rate is still relatively high.
  • the main object of the present application is to provide a method, apparatus, computer device and storage medium for establishing a voiceprint model that reduces the recognition error rate of short sound text.
  • the present application provides a method for establishing a voiceprint model, including:
  • the present application also provides an apparatus for establishing a voiceprint model, including:
  • an extraction module configured to frame a voice signal of the input target user, and separately extract a voice acoustic feature of the framed voice signal;
  • a cluster structure module configured to input a plurality of the voice acoustic features into a deep learning model based on neural network training, to synthesize at least one cluster structure
  • a calculation module configured to calculate an average value and a standard deviation of at least one of the cluster structures
  • a feature vector module configured to perform coordinate transformation and activation function calculation on the average value and the standard deviation to obtain a feature vector parameter
  • a model module configured to input the feature vector parameter and the identity verification result of the target user into a preset basic model, to obtain a voiceprint model corresponding to the target user, where the voiceprint model is used for verification Whether the input voice signal is for the target user.
  • the present application also provides a computer device comprising a memory and a processor, the memory storing computer readable instructions, the processor executing the computer readable instructions to implement the steps of any of the methods described above .
  • the present application also provides a computer non-transitory readable storage medium having stored thereon computer readable instructions, the computer readable instructions being executed by a processor to implement the steps of any of the methods described above .
  • the method, device, computer device and storage medium for establishing a voiceprint model of the present application the extracted speech acoustic features are based on deep neural network training to obtain a cluster structure, and then the cluster structure is subjected to coordinate mapping and activation function calculation.
  • the resulting voiceprint model can reduce the voice recognition error rate of the voiceprint model.
  • FIG. 1 is a schematic flow chart of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a step S2 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a step S22 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a step S5 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a step S1 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 6 is a schematic flowchart of a step S 11 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 7 is a schematic flow chart of a method for establishing a voiceprint model according to an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a step S1 of a method for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 9 is a schematic structural diagram of an apparatus for establishing a voiceprint model according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a cluster structure module of an apparatus for establishing a voiceprint model according to an embodiment of the present invention
  • FIG. 11 is a schematic structural diagram of a model module of an apparatus for establishing a voiceprint model according to an embodiment of the present application; ;
  • FIG. 12 is a schematic structural diagram of an extraction module of an apparatus for establishing a voiceprint model according to an embodiment of the present application
  • FIG. 13 is a schematic structural diagram of an apparatus for establishing a voiceprint model according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of an extraction module of an apparatus for establishing a voiceprint model according to an embodiment of the present application
  • 15 is a schematic block diagram showing the structure of a computer device according to an embodiment of the present application.
  • an embodiment of the present application provides a method for establishing a voiceprint model, including the following steps:
  • S1 framing a voice signal of the input target user, and separately extracting voice sound characteristics of the framed voice signal;
  • S5 input the feature vector parameter and the identity verification result of the target user into a preset basic model, to obtain a voiceprint model corresponding to the target user, where the voiceprint model is used to verify the input voice. Whether the signal is for the target user.
  • the voiceprint is a sound wave spectrum carrying speech information displayed by an electroacoustic instrument.
  • the generation of human language is a complex physiological and physical process between the human language center and the vocal organs.
  • the vocal organs (tongue, teeth, throat, lungs, nasal cavity) used by people in speech are very different in size and shape. So, the soundprints of any two people are different.
  • a speech signal is an analog signal that carries a specific information, and its source is a speech signal converted from a sound signal emitted by a person.
  • Each person's voiceprint is different. Therefore, the same person who speaks the same sound and then converts the voice signal into a different voice signal.
  • Speech Acoustics Features are the voiceprint information contained in the sounds emitted by each person. Framing refers to dividing a continuous speech signal into multiple segments. At the speech rate of normal speech, the duration of the phoneme is about 50 ⁇ 200 milliseconds, so the frame length is generally less than 50 milliseconds. Microscopically, it must include enough vibration cycles. The audio of the voice, the male voice is around 100 Hz, the female voice is around 200 Hz, and the conversion period is 10 milliseconds and 5 milliseconds. Generally, a frame should contain multiple cycles, so it usually takes at least 20 milliseconds.
  • the so-called speech signal includes a continuous speech, such as a sentence, a paragraph, and the like.
  • the speech acoustic feature may be a Mel Frequency Cepstral Coefficient (MFCC) of the speech segment, or a Perceptual Linear Prediction Coefficient (PLP), or a Filter Bank Feature, or the like.
  • MFCC Mel Frequency Cepstral Coefficient
  • PLP Perceptual Linear Prediction Coefficient
  • Filter Bank Feature or the like.
  • the speech acoustic feature can also be the original speech data of the speech segment.
  • the speech acoustic feature in the speech signal of the target user is extracted, and the speech signal of the person who needs to establish the voiceprint model is extracted, and the speech signal generated by the non-target user is not extracted.
  • the speech acoustic feature is a speech signal containing a portion of a speech that is extracted from a continuous speech signal, and thus a continuous speech signal. After the speech signal is framed, a plurality of speech signals are obtained, and the speech acoustic characteristics of each speech signal are respectively extracted, thereby obtaining a plurality of speech acoustic features.
  • the speech acoustic feature is extracted from the framed speech signal, is a speech signal, and the speech signal is input into the neural network training model, the purpose is to collect the speech acoustic features.
  • Computation convenient for statistics and calculation of speech acoustic features.
  • a cluster structure is a set of computational results for one or more phonological acoustic features that embody the same common features of a plurality of speech acoustic features.
  • the method for calculating the average value of the plurality of cluster structures is: first according to the formula:
  • the above E(x) and D(x) are subjected to a level mapping and a b level mapping.
  • the a-level mapping is a coordinate transformation of the average value and the standard deviation of the cluster structure
  • the b-level mapping is obtained by calculating the mean value and the standard deviation of the cluster structure through an activation function, and the result is a sound.
  • the eigenvector parameters of the stencil model are a coordinate transformation of the average value and the standard deviation of the cluster structure.
  • the system inputs the feature vector parameter and the identity verification result of the target user to the preset base model, and obtains a voiceprint model of the target user, and the voiceprint model receives the voice signal, and then determines to generate Whether the person with the voice signal is the voice of the target user.
  • the base model refers to a neural network model, such as a BP neural network model.
  • BP neural network is a multi-layer network for weight training of nonlinear differentiable functions. Its biggest feature is that it can realize the high nonlinearity of the system from pm space to yn space n (the number of output nodes) consisting of the mode vector p of m input neurons without using the mathematical model of the system. Mapping.
  • the activation function of the b-level map can be Sigmoid.
  • the deep learning model includes a multi-layer model layer, and the plurality of the voice acoustic features are input into a deep learning model based on neural network training, and at least one cluster structure is synthesized.
  • the steps include:
  • S21 input a plurality of the voice acoustic features into a deep learning model
  • S23 Select a target model layer in the multi-layer model layer, and acquire at least one cluster structure generated on the target model layer.
  • a plurality of speech acoustic features are extracted from a continuous speech signal, and thus the plurality of speech acoustic features are also continuous.
  • a plurality of said speech acoustic features are input into the deep learning model, they are also input in chronological order.
  • the plurality of speech acoustic features are each a continuous sound signal, which together is also a continuous sound signal, in which any time node t is selected, and then The speech acoustic features in the time period tn are collected at a distance t, and a cluster structure is formed on one of the model layers. Since the deep learning model has a multi-layer model layer, the time node t selected on each model layer is different from the time period tn at the time t, and the number of cluster structures generated by each layer of the model layer is not completely the same.
  • the plurality of speech acoustic features have a total of 10 seconds, that is, 10000 ms, and the selected time node is 2000 ms.
  • the first model layer is established, and the first model layer is established every interval of tl (lms), and the first model layer is shared. 10,000 frames.
  • the second model layer is established, taking t2 as 2ms, and the second model layer is established every 2ms, and the second model layer has 500 frames.
  • step S23 after learning through the deep learning model, multiple model layers are obtained, each of which has multiple cluster structures, and the system selects one of the model layers as the target model layer, and selects the target.
  • the cluster structure on the model layer serves as a parameter for the subsequent generation of the voiceprint model.
  • step S22 includes the following steps:
  • any time node t is selected, for example, the speech acoustic feature has 10 seconds, ie, 1 0000 ms, and the selected time node is 2000 ms, and the first model layer is established, each interval tl (lms) The first model layer is established in time, and the first model layer has 10,000 frames.
  • step S222 based on the first model layer, the selection time node is still 2000 ms, and the second model layer is established every time t2 (2 ms), and the second model layer has 5000 frames.
  • step S223 based on the second model layer, the selection time node is still 2000 ms, and the third model layer is established every t3 (3 ms), and the third model layer has 3334 frames.
  • step S224 based on the third model layer, the selection time node is still 2000 ms, and the fourth model layer is established every t4 (4 ms), and the fourth model layer has 2500 frames.
  • step S225 based on the fourth model layer, the selection time node is still 2000 ms, and the second model layer is established every t5 (8 ms), and the second model layer has 1250 frames. Finally, the 1250 frames on the fifth model layer are grouped into a cluster structure, and after five layers of deep learning models, 1250 cluster structures are finally obtained.
  • the feature vector parameter and the identity verification result of the target user are input into a preset basic model, and a voiceprint model corresponding to the target user is obtained. Steps, including:
  • the system uses a linear discriminant analysis (LDA) based on probability to perform dimensionality reduction.
  • LDA linear discriminant analysis
  • the model design of the target user's voiceprint is then performed.
  • the output layer adopts the Softmax function calculation result. All nodes are initialized with uniform random weights in the interval [-0.05 ⁇ 0.05], and the initial offset is 0, and the final voiceprint model is obtained.
  • the input to the softmax function is a vector, and its output is also a vector. Each element in the vector is a probability value between 0 and 1.
  • the training model is biased, the training set obtained by each training predicts the deviation of the label from the original real label. If the deviation is too small, it will lead to over-fitting, because the noise in the training set may also be learned.
  • the bias describes the fitting ability of the learning algorithm itself. If the fitting ability is not good, the offset is large, and the under-fitting occurs. On the contrary, the fitting ability is too good, the offset is small, and over-fitting is easy. In training When you practice, you can see that this bias should be gradually smaller, indicating that our model is constantly learning useful things.
  • the step of extracting the voice acoustic features of the framed speech signal includes:
  • S13 Performing a discrete cosine transform on the formant feature to obtain a speech acoustic feature.
  • step S11 the effective speech signal extracted after the frame is subjected to fast Fourier transform, and the speech signal in the time domain is converted into the energy spectrum in the frequency domain.
  • Fast Fourier Transform is a fast algorithm for discrete Fourier transform. It is based on the odd, even, imaginary, and real properties of discrete Fourier transforms.
  • the formant is an important feature reflecting the resonance characteristics of the channel, which represents the most direct source of the pronunciation information, and the person utilizes the formant information in the speech perception. Therefore, the formant is a very important feature parameter in speech signal processing and has been widely used as the main feature of speech recognition and basic information of speech coding transmission.
  • the formant information is included in the frequency envelope, so the key to the formant parameter extraction is to estimate the natural speech spectral envelope. It is generally considered that the maximum value in the spectral envelope is the formant.
  • the energy spectrum is then input into the Meyer-scale triangular filter to calculate the logarithmic energy of each filter bank output.
  • the characteristics of the filter bank output are also called Filter Bank (FBANK) features.
  • the filter bank can streamline the amplitude of the frequency domain, each frequency band is represented by a value; the specific step of filtering is The spectrum obtained by the fast Fourier transform is separately multiplied and accumulated by each filter, and the obtained value is the energy value of the frame data in the corresponding frequency band of the filter.
  • the MFCC coefficient (melcation cepstrum coefficient), that is, the MFCC acoustic feature, is obtained by discrete cosine transform. Since the human ear's perception of sound is not linear, it is better described by the nonlinear relationship of log. Cepstrum analysis can be performed after the log is taken. Therefore, the energy value is logarithmically calculated to obtain logarithmic energy.
  • the step of obtaining the voiceprint model described above includes:
  • the voiceprint model has a port for receiving a voice signal. After receiving the voice signal, the voiceprint model calculates the voice signal, and if it is the target user's voice signal, outputs the target correct signal; if it is not the target user's voice signal, it outputs the target error signal.
  • the step of performing fast Fourier transform calculation on the framed speech signal includes:
  • S114 Perform fast Fourier transform calculation on the valid partial speech signal.
  • the voice signal is pre-emphasized, because the voice signal also contains some noise and noise. If the voice signal is directly processed into the voiceprint, some noise and noise are obtained. The impact, the established model is not accurate, will lead to an increase in the recognition error rate.
  • Direct extraction of effective speech is achieved by means of speech endpoint detection, that is, from which point in the speech is the person starting to talk, and which moment is the end of the speech.
  • the main principle of voice endpoint detection is: The audio spectrum of an audio file containing human speech speech is higher than the speech spectrum of an audio file that does not contain human speech. Therefore, before extracting valid speech, the speech signal is pre-emphasized, ie Amplify the speech signal to make the spectrum containing the human speech part higher, and the difference between the two is more obvious, so that the speech endpoint detection is better.
  • one of the goals that the speech signal processing often achieves is to clarify the distribution of the respective frequency components in the speech.
  • the mathematical tool for doing this is the Fourier transform.
  • the Fourier transform requires the input signal to be smooth.
  • the voice is not stable on a macro level. But from a microscopic point of view, the speech signal can be seen as stationary, and it can be intercepted to do the Fourier transform.
  • the purpose of windowing is to let the amplitude of a frame of signals fade to zero at both ends.
  • a gradient to 0 is good for the Fourier transform, which improves the resolution of the transform result (ie, the spectrum).
  • step S113 since the voice signal further contains some noise and noise, if the voice signal is directly subjected to the voiceprint modeling process, some noises and noises are obtained, and the established model is inaccurate. Directly lead to increased recognition error rate.
  • Direct extraction of effective speech is achieved by means of speech endpoint detection, that is, identifying from which point in the speech is the person starting to talk, and which moment is the end of the speech. Through endpoint detection, distinguish between speech and noise, and extract valid speech parts. People will also pause when they speak. The voice of the effective part is extracted, and the noise part when the person pauses when the person speaks is removed, and only the effective voice of the spoken part of the person is extracted.
  • the fast Fourier transform is a fast algorithm of the discrete Fourier transform, which is based on the odd, even, imaginary, and real properties of the discrete Fourier transform, on the discrete Fourier transform.
  • the algorithm was improved. In this way, the speech acoustic characteristics of the speaker in a speech can be calculated.
  • the step of obtaining the voiceprint model includes:
  • S7 Receive attribute information of the voiceprint model mark by the user, where the attribute information includes gender, age, and ethnicity of the target user.
  • the system receives the mark added by the user to the voiceprint model, and marks the personal information of the target user corresponding to the voiceprint model, including gender, age, ethnicity, height, and weight. Wait. Because the voiceprint information is related to the vocal organ, the vocal control organ includes vocal cords, soft palate, tongue, teeth, lips, etc. The vocal resonator includes the pharyngeal cavity, the oral cavity, and the nasal cavity. People with similar organs will have a certain commonality or close proximity. Therefore, the voiceprint information of people with the same attribute information will be similar. After collecting the voiceprint information of many people, summarize them to find out the relationship between the voiceprint information and people.
  • the step of extracting a voice acoustic feature of the framed speech signal includes:
  • step S14 the voice content of the input framed speech signal is recognized, that is, the voice signal is recognized by the means of voice recognition, and the specific speech text information of the speaker is recognized.
  • step S15 determining the utterance part of the voice content is based on the voice content recognized in the above S14, reading the pinyin or the phonetic symbol of the voice content, and determining the utterance part according to the content of the pinyin or the phonetic symbol.
  • Commonly used main sounding parts are throat, tongue, nose, teeth and so on. For example, in Mandarin, the corresponding vocal part is determined according to different initials.
  • the table corresponding to the specific initial and the sounding part is as follows:
  • the utterance part of the voice signal is retrospectively checked, and the voice signal is split into multiple segments according to the utterance part corresponding to the voice signal, and each voice signal corresponds to one utterance Part.
  • a speech signal with a duration of 10 seconds the speech content of 0-2 seconds contains an initial or b or p or m, and the speech content of the 3-5th second contains the initials of j or q or x.
  • the speech content of the 6-1 0th second contains the initials of d or t or n or 1, and then the speech signal is split into three speech signals.
  • the first segment is the speech content of 0-2 seconds
  • the second segment is the speech content of the 3-5th second
  • the third segment is the speech content of the 6th-10th second.
  • the method for establishing a voiceprint model of the present application extracts the extracted acoustic features based on the depth neural network training to obtain a cluster structure, and then performs coordinate mapping and activation function calculation on the cluster structure.
  • the voiceprint model can reduce the voice recognition error rate of the voiceprint model.
  • the present application further provides an apparatus for establishing a voiceprint model, including:
  • the extracting module 1 is configured to frame the voice signal of the input target user, and respectively extract the voice acoustic characteristics of the framed voice signal; [0101] a cluster structure module 2, configured to input a plurality of the voice acoustic features into a deep learning model based on neural network training, and combine to synthesize at least one cluster structure;
  • a calculation module 3 configured to calculate an average value and a standard deviation of at least one of the cluster structures
  • the feature vector module 4 is configured to perform coordinate transformation and activation function calculation on the average value and the standard deviation to obtain a feature vector parameter
  • the model module 5 is configured to input the feature vector parameter and the identity verification result of the target user into a preset basic model, to obtain a voiceprint model corresponding to the target user, where the voiceprint model is used Verify that the input voice signal is for the target user.
  • the voiceprint in the extraction module 1 is a sound wave spectrum carrying speech information displayed by an electroacoustic instrument.
  • the generation of human language is a complex physiological and physical process between the human language center and the vocal organs.
  • the vocal organs (tongue, teeth, throat, lungs, nasal cavity) used by people in speech are very different in size and shape. So, the soundprints of any two people are different.
  • a speech signal is an analog signal that carries a specific information, and its source is a speech signal converted into a sound signal emitted by a person.
  • Each person's voiceprint is different. Therefore, the same person can say the same voice and then convert the voice signal into a different voice signal.
  • the speech acoustic feature is the voiceprint information contained in the sound emitted by each person.
  • Framing refers to dividing a continuous speech signal into multiple segments. At the rate of speech of a normal speech, the duration of the phoneme is about 50 to 200 milliseconds, so the frame length is generally taken to be less than 50 milliseconds. Microscopically, it must include enough vibration cycles.
  • the audio of the voice the male voice is around 100 Hz, the female voice is around 200 Hz, and the conversion period is 10 milliseconds and 5 milliseconds. Generally, a frame should contain multiple cycles, so it usually takes at least 20 milliseconds.
  • the so-called speech signal includes a continuous speech, such as a sentence, a paragraph, and the like.
  • the speech acoustic feature may be a Mel Frequency Cepstral Coefficient (MFCC) of the speech segment, or a Perceptual Linear Prediction Coefficient (PLP), or a Filter Bank Feature, or the like.
  • MFCC Mel Frequency Cepstral Coefficient
  • PLP Perceptual Linear Prediction Coefficient
  • Filter Bank Feature or the like.
  • the speech acoustic feature may also be the original speech data of the speech segment.
  • the extraction module 1 extracts the voice acoustic features in the voice signal of the target user, and extracts the voice signal of the person who needs to establish the voiceprint model, and the voice signal generated by the non-target user does not extract.
  • the speech acoustic feature is a speech signal containing a spoken portion extracted from a continuous speech signal, and thus a continuous speech signal.
  • the extraction module 1 frames the speech signal, a plurality of speech signals are obtained, and the speech sounds of each speech signal are respectively extracted. Learning characteristics, then a plurality of speech acoustic features are obtained.
  • the speech acoustic feature is extracted from the framed speech signal, and is a segment of the speech signal, and the cluster structure module 2 inputs the speech signal into the neural network training model, so as to collect and calculate the acoustic characteristics of the speech. Statistics and calculations of speech acoustic features.
  • the cluster structure module 2 is a collection of speech acoustic features of the segment, which can represent the same common features of a plurality of speech acoustic features.
  • the method for calculating the average value of the plurality of cluster structures is: first calculating the module 3 according to the formula
  • the feature vector module 4 passes the above-described E(x) and D(x) through the a-level mapping and the b-level mapping.
  • the a-level mapping is a coordinate transformation of the average value and the standard deviation of the cluster structure
  • the b-level mapping is obtained by calculating the mean value and the standard deviation of the cluster structure through an activation function, and the result is a sound.
  • the eigenvector parameters of the stencil model are a coordinate transformation of the average value and the standard deviation of the cluster structure.
  • the model module 5 inputs the feature vector parameter and the identity verification result of the target user to the preset base model, and obtains a voiceprint model of the target user. After receiving the voice signal, the voiceprint model determines the person who generates the voice signal. Whether it is the voice of the target user.
  • the base model refers to a neural network model, such as a BP neural network model.
  • BP neural network is a multi-layer network for weight training of nonlinear differentiable functions. Its biggest feature is that it can realize the high nonlinearity of the system from pm space to yn space n (the number of output nodes) consisting of the mode vector p of m input neurons without using the mathematical model of the system. Mapping.
  • the activation function of the b-level map can be Sigmoid.
  • the Sigmoid function is a common S-type function in biology, also known as the S-type growth curve. It is mainly used as a threshold function of the neural network, and is closest to the organism in the physical sense.
  • X is the acoustical feature of the input speech
  • e is a natural constant. It is a law of mathematics, about 2.71828 .
  • the deep learning model includes a multi-layer model layer
  • the cluster structure module 2 includes:
  • an input unit 21 configured to input a plurality of the voice acoustic features into a deep learning model
  • the establishing unit 22 is configured to select any one of the plurality of speech acoustic features to establish an nth model layer, wherein n is a positive integer, from a speech acoustic feature of the time node t every tn time ;
  • the selecting unit 23 is configured to select a target model layer in the multi-layer model layer, and acquire at least one cluster structure generated on the target model layer.
  • a plurality of speech acoustic features are extracted from a continuous speech signal, and thus a plurality of speech acoustic features are also continuous.
  • the input unit 21 inputs a plurality of the speech acoustic characteristics into the deep learning model, it is also input in chronological order.
  • the plurality of speech acoustic features are each a continuous sound signal, and together they are also a continuous sound signal.
  • the establishing unit 22 selects any time node t and then the distance t.
  • the establishing unit 22 establishes a first model layer, and the first model layer is established every interval of tl (lms), and the first model is The layer has a total of 100 00 frames. Then, the establishing unit 22 establishes a second model layer, taking 2 for 2 ms, and establishing a second model layer every 2 ms, and the second model layer has 500 frames.
  • the model module 4 includes:
  • a dimension reduction unit 51 configured to perform dimension reduction on a feature vector parameter of the voiceprint model
  • the model unit 52 is configured to input the reduced-dimensional feature vector parameter into a preset basic model to obtain a voiceprint model.
  • the dimension reduction unit 51 performs dimensionality reduction using a linear discriminant analysis (LDA) based on probability.
  • LDA linear discriminant analysis
  • the model unit 52 then performs a model design of the voiceprint of the target user.
  • the output layer adopts the Softmax function calculation result. All nodes are initialized with uniform random weights in the interval [-0.05 ⁇ 0.05], and the initial offset is 0, and the final voiceprint model is obtained.
  • the input to the softmax function is a vector, and its output is also a vector. Each element in the vector is a probability value between 0 and 1.
  • the training set obtained by each training predicts the deviation of the label from the original real label. If the deviation is too small, it will lead to over-fitting, because the noise in the training set may also be learned. . Therefore, the bias describes the fitting ability of the learning algorithm itself. If the fitting ability is not good, the offset is large, and the under-fitting occurs. On the contrary, the fitting ability is too good, the offset is small, and over-fitting is easy. At the time of training, it can be found that this bias should be gradually reduced, indicating that our model is constantly learning useful things.
  • the extraction module 1 includes:
  • the calculating unit 11 is configured to perform fast Fourier transform calculation on the framed speech signal to obtain an energy spectrum
  • an input unit 12 configured to input the energy spectrum into a triangular filter bank of a Meyer scale, and output a formant characteristic
  • the transforming unit 13 is configured to perform discrete cosine transform on the formant feature to obtain a speech acoustic feature.
  • the calculating unit 11 performs fast Fourier transform on the effective speech signal extracted after the framing, and converts the time domain speech signal into the energy spectrum in the frequency domain.
  • Fast Fourier Transform is a fast algorithm of discrete Fourier transform. It is based on the odd, even, imaginary and real properties of discrete Fourier transform, and the algorithm of discrete Fourier transform is improved.
  • the formant is an important feature that reflects the resonance characteristics of the channel. It represents the most direct source of pronunciation information, and humans use formant information in speech perception. Therefore, the formant is a very important characteristic parameter in speech signal processing, and has been widely used as the main feature of speech recognition and basic information of speech coding transmission.
  • the formant information is included in the frequency envelope, so the key to the formant parameter extraction is to estimate the natural speech spectral envelope. It is generally considered that the maximum value in the spectral envelope is the formant.
  • the input module 12 then inputs the energy spectrum into the triangular filter bank of the Meyer scale to calculate the logarithmic energy output of each filter bank.
  • the characteristics of the filter bank output are also referred to as Filter Bank (FBANK) features. Filtering using the Mel scale filter bank, the purpose is because the frequency domain signal has a lot of redundancy, filtering
  • the group can streamline the amplitude of the frequency domain, and each frequency band is represented by a value; the specific step of filtering is to multiply and accumulate the frequency spectrum obtained by fast Fourier transform with each filter.
  • the value is the energy value of the frame data in the corresponding frequency band of the filter.
  • the melon cepstrum coefficient that is, the MFCC acoustic feature
  • the transform unit 13 Since the human ear's perception of sound is not linear, it is better described by the nonlinear relationship of log. Cepstrum analysis can be performed after the log is taken. Therefore, the energy value is logarithmically calculated to obtain logarithmic energy.
  • the logarithmic energy is subjected to discrete cosine transform, and finally the MFCC coefficient (melcation cepstrum coefficient), that is, the MFCC acoustic characteristic is obtained.
  • the foregoing apparatus for establishing a voiceprint model further includes:
  • the verification module 6 is configured to input the voice signal to be verified into the voiceprint model to obtain an identity verification result output by the voiceprint model.
  • the voiceprint model after the voiceprint model is established, the voiceprint model has a port for receiving a voice signal.
  • the voiceprint model calculates the voice signal, and if it is the voice signal of the target user, the verification module 6 outputs a signal with the correct target; if it is not the voice signal of the target user, the verification module 6 outputs The signal of the target error.
  • the device for establishing a voiceprint model further includes:
  • the attribute module 7 is configured to receive attribute information of the voiceprint model mark by the user, where the attribute information includes a gender, an age, and a nationality of the target user.
  • the attribute module 7 receives the mark added by the user to the voiceprint model, and marks the personal information of the target user corresponding to the voiceprint model, including gender, age, ethnicity, height, Weight and so on.
  • the voiceprint information is related to the vocal organ
  • the vocal control organ includes vocal cords, soft palate, tongue, teeth, lips, etc.
  • the vocal resonator includes the pharyngeal cavity, the oral cavity, and the nasal cavity. People who are similar in vocal organs have a certain commonality or close proximity. Therefore, the voiceprint information of people with the same attribute information will be similar. After collecting the voiceprint information of many people, summarize them to find out the relationship between the voiceprint information and people.
  • the extracting module 1 further includes:
  • the identifying unit 14 is configured to identify the voice content of the input framed speech signal; [0134] the determining unit 15 is configured to determine a sounding part of the voice content;
  • the splitting unit 16 is configured to split the voice signal according to the sounding part
  • the extracting unit 17 is configured to separately extract a speech acoustic feature from the split speech signal.
  • the recognition unit 14 recognizes the voice content of the input voice signal, that is, the voice signal is recognized by means of voice recognition, and the specific voice text information of the speaker is recognized.
  • the determining unit 15 determines that the utterance part of the voice content is based on the voice content recognized by the recognition unit 14, reads the pinyin or the phonetic symbol of the voice content, and determines the utterance part according to the content of the pinyin or the phonetic symbol.
  • Commonly used main sounding parts are throat, tongue, nose, teeth and so on. For example, in Mandarin, the corresponding vocal part is determined according to different initials.
  • the table corresponding to the specific initial and the sounding part is as follows:
  • the splitting unit 16 After the determining unit 15 determines the sounding part of the voice content, the splitting unit 16 traces back the sounding part of the voice signal, and then the splitting unit 16 splits the voice signal into multiple segments according to the sounding part corresponding to the voice signal, and each segment
  • the voice signals correspond to one sounding part.
  • a speech signal with a duration of 10 seconds the speech content of 0-2 seconds contains an initial or b or p or m
  • the speech content of the 3-5th second contains the initials of j or q or x.
  • the voice content of the 6th to 10th seconds includes the initials of d or t or n or 1, and then the splitting unit 16 splits the voice signal into three pieces of voice signals.
  • the first segment is the speech content of 0-2 seconds
  • the second segment is the speech content of the 3-5th second
  • the third segment is the speech content of the 6-10th second.
  • the extracting unit 17 then extracts acoustic features for the three pieces of speech content, respectively, and then inputs them into the subsequent deep learning model.
  • the apparatus for establishing a voiceprint model of the present application extracts the extracted acoustic features based on the depth neural network training to obtain a cluster structure, and then performs coordinate mapping and activation function calculation on the cluster structure.
  • the voiceprint model can reduce the voice recognition error rate of the voiceprint model.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 15.
  • the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the memory provides an environment for the operation of operating systems and computer readable instructions in a non-volatile storage medium.
  • the database of the computer device is used to store data such as a voiceprint model.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer readable instructions when executed, perform the flow of an embodiment of the methods described above. It will be understood by those skilled in the art that the structure shown in FIG. 15 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation of the computer device to which the present application is applied.
  • An embodiment of the present application further provides a computer non-volatile readable storage medium having stored thereon computer readable instructions that, when executed, perform the processes of the embodiments of the methods described above.
  • the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process changes made by the specification and the drawings of the present application, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present application.

Abstract

一种建立声纹模型的方法、装置、计算机设备和存储介质,方法包括:将语音信号中的语音声学特征集合成多个簇结构;计算多个簇结构的平均值和标准差后进行坐标变换以及激活函数计算,得到特征向量参数;然后根据特征向量参数得到声纹模型。该声纹模型,可以降低声纹模型的声音识别错误率。

Description

建立声纹模型的方法、 装置、 计算机设备和存储介质
[0001] 本申请要求于 2018年 5月 8日提交中国专利局、 申请号为 201810433792X, 发明 名称为“建立声纹模型的方法、 装置、 计算机设备和存储介质”的中国专利申请的 优先权, 其全部内容通过引用结合在本申请中。
技术领域
[0002] 本申请涉及到计算机技术领域, 特别是涉及到一种建立声纹模型的方法、 装置 、 计算机设备和存储介质。
背景技术
[0003] 声纹是用电声学仪器显示的携带言语信息的声波频谱。 现代科学研究表明, 声 纹不仅具有特定性, 而且有相对稳定性的特点。 成年以后, 人的声音可保持长 期相对稳定不变。 声纹识别算法通过学习声音图谱中抽取各种语音特征, 建立 识别模型, 从而来确认说话人。 目前的声纹识别方法, 对于长声音文本 (说话 人语音长度超过 1分钟以上的) 效果很好, 但是对于短声音文本 (说话人语音长 度少于 1分钟, 例如 20s左右) 来说, 识别的错误率还比较高。
[0004] 因此, 如何建立一种可以降低短声音文本识别的错误率的声纹模型是亟需解决 的问题。
发明概述
技术问题
[0005] 本申请的主要目的为提供一种建立降低短声音文本的识别错误率的声纹模型的 方法、 装置、 计算机设备和存储介质。
问题的解决方案
技术解决方案
[0006] 为了实现上述发明目的, 本申请提出一种建立声纹模型的方法, 包括:
[0007] 对输入的目标用户的语音信号分帧, 分别提取分帧后的语音信号的语音声学特 征;
[0008] 将多个所述语音声学特征输入基于神经网络训练的深度学习模型中, 集合成至 少一个簇结构;
[0009] 计算至少一个所述簇结构的平均值和标准差;
[0010] 将所述平均值和标准差进行坐标变换以及激活函数计算, 得到特征向量参数;
[0011] 将所述特征向量参数以及所述目标用户的身份验证结果输入预设的基础模型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验证输入的语音信号 是否为所述目标用户的。
[0012] 本申请还提供一种建立声纹模型的装置, 包括:
[0013] 提取模块, 用于对输入的目标用户的语音信号分帧, 分别提取分帧后的语音信 号的语音声学特征;
[0014] 簇结构模块, 用于将多个所述语音声学特征输入基于神经网络训练的深度学习 模型中, 集合成至少一个簇结构;
[0015] 计算模块, 用于计算至少一个所述簇结构的平均值和标准差;
[0016] 特征向量模块, 用于将所述平均值和标准差进行坐标变换以及激活函数计算, 得到特征向量参数;
[0017] 模型模块, 用于将所述特征向量参数以及所述目标用户的身份验证结果输入预 设的基础模型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验证 输入的语音信号是否为所述目标用户的。
[0018] 本申请还提供一种计算机设备, 包括存储器和处理器, 所述存储器存储有计算 机可读指令, 所述处理器执行所述计算机可读指令时实现上述任一项所述方法 的步骤。
[0019] 本申请还提供一种计算机非易失性可读存储介质, 其上存储有计算机可读指令 , 所述计算机可读指令被处理器执行时实现上述任一项所述的方法的步骤。 发明的有益效果
有益效果
[0020] 本申请的建立声纹模型的方法、 装置、 计算机设备和存储介质, 将提取出的语 音声学特征基于深度神经网络训练中得出簇结构, 然后将簇结构进行坐标映射 和激活函数计算, 得出的声纹模型, 可以降低声纹模型的声音识别错误率。 对附图的简要说明 附图说明
[0021] 图 1为本申请一实施例的建立声纹模型的方法的流程示意图;
[0022] 图 2为本申请一实施例的建立声纹模型的方法的 S2步骤的流程示意图;
[0023] 图 3为本申请一实施例的建立声纹模型的方法的 S22步骤的流程示意图;
[0024] 图 4为本申请一实施例的建立声纹模型的方法的 S5步骤的流程示意图;
[0025] 图 5为本申请一实施例的建立声纹模型的方法的 S 1步骤的流程示意图;
[0026] 图 6为本申请一实施例的建立声纹模型的方法的 S 11步骤的流程示意图;
[0027] 图 7为本申请一实施例的建立声纹模型的方法的流程示意图;
[0028] 图 8为本申请一实施例的建立声纹模型的方法的 S 1步骤的流程示意图;
[0029] 图 9为本申请一实施例的建立声纹模型的装置的结构示意图;
[0030] 图 10为本申请一实施例的建立声纹模型的装置的簇结构模块的结构示意图; [0031] 图 11为本申请一实施例的建立声纹模型的装置的模型模块的结构示意图;
[0032] 图 12为本申请一实施例的建立声纹模型的装置的提取模块的结构示意图;
[0033] 图 13是本申请一实施例的建立声纹模型的装置的结构示意图;
[0034] 图 14是本申请一实施例的建立声纹模型的装置的提取模块的结构示意图;
[0035] 图 15为本申请一实施例的计算机设备的结构示意框图。
实施该发明的最佳实施例
本发明的最佳实施方式
[0036] 参照图 1, 本申请实施例提供一种建立声纹模型的方法, 包括步骤:
[0037] S1、 对输入的目标用户的语音信号分帧, 分别提取分帧后的语音信号的语音声 学特征;
[0038] S2、 将多个所述语音声学特征输入基于神经网络训练的深度学习模型中, 集合 成至少一个簇结构;
[0039] S3、 计算至少一个所述簇结构的平均值和标准差;
[0040] S4、 将所述平均值和标准差进行坐标变换以及激活函数计算, 得到特征向量参 数;
[0041] S5、 将所述特征向量参数以及所述目标用户的身份验证结果输入预设的基础模 型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验证输入的语音 信号是否为所述目标用户的。
[0042] 如上述步骤 S1所述, 声纹是用电声学仪器显示的携带言语信息的声波频谱。 人 类语言的产生是人体语言中枢与发音器官之间一个复杂的生理物理过程, 人在 讲话时使用的发声器官 (舌、 牙齿、 喉头、 肺、 鼻腔)在尺寸和形态方面每个人的 差异很大, 所以任何两个人的声纹都有差异。 语音信号是一种搭载着特定的信 息模拟信号, 其来源是由人发出的声音信号转换成的语音信号。 每个人的声纹 不一样, 因而, 相同的人说出同样的话产生的声音而后转换成的语音信号也是 不一样的。 因而, 语音信号里所包含的语音声学特征也是不一样的。 语音声学 特征是每个人发出的声音中包含的声纹信息。 分帧是指将连续的语音信号分成 多段。 人在正常讲话的语速下, 音素的持续时间大约是 50~200毫秒, 所以帧长 一般取为小于 50毫秒。 从微观上来看, 它又必须包括足够多的振动周期。 语音 的音频, 男声在 100赫兹左右, 女声在 200赫兹左右, 换算成周期就是 10毫秒和 5 毫秒。 一般一帧要包含多个周期, 所以一般取至少 20毫秒。 所谓的语音信号包 括一段连续的语音, 例如一个句子、 一段话等。 所述语音声学特征可为所述语 音片段的梅尔频率倒谱系数 (MFCC), 或感知线性预测系数 (PLP), 或滤波器组特 征 (Filter Bank Feature)等。 当然, 所述语音声学特征也可为所述语音片段的原始 语音数据。 将目标用户的语音信号中的语音声学特征提取出来, 是将需要建立 声纹模型的人说话的声音信号提取出来, 非目标用户说话产生的语音信号则不 进行提取。 语音声学特征是从一段连续的语音信号中提取出来的包含有人说话 的部分的语音信号, 因而也是一段连续的语音信号。 将语音信号分帧后, 得到 多段语音信号, 分别提取出每段语音信号的语音声学特征, 则得到多个语音声 学特征。
[0043] 如上述步骤 S2所述, 语音声学特征是从分帧的语音信号中提取出来的, 是一段 语音信号, 将该语音信号输入到神经网络训练模型中, 目的是将语音声学特征 进行集合计算, 方便统计与计算语音声学特征。 簇结构是对一个或多个语音声 学特征的集合计算结果, 能体现出多个语音声学特征集合在一起的相同的共性 特征。
[0044] 如上述步骤 S3所述, 将多个语音声学特征输入基于神经网络的深度学习模型后 , 输出得到至少一个簇结构 x-1、 x-2、 x-1、 ...xn, 假设簇结构是一个 p维向量, 则 xn=(xil,xi2,...,xip) T (i=l,2,...n)。 计算这些簇结构的均值和标准差。 得到簇结构 的平均值和标准差。 其中, 计算多个簇结构的平均值的方法为:首先根据公式:
Figure imgf000007_0001
, 计算每个分量的平均值, 然后再根据公式: xO=(xl,x2,...,xj) T, 计算出 p维的平 均向量, 将 p维的平均向量组合形成簇结构的平均值: E(x)。 计算多个簇结构的 标准差的公式为: D(x)=E{ [x- E(x)][x-E(x)] T}°
[0045] 如上述步骤 S4所述, 将上述的 E(x)和 D(x)经过 a级映射和 b级映射。 其中, a级映 射是将簇结构的平均值和标准差进行坐标变换, b级映射是将簇结构的平均值和 标准差通过激活函数计算后得出一个非线性结果, 该结果即为建立声纹模型的 特征向量参数。
[0046] 如上述步骤 S5所述, 系统将特征向量参数以及目标用户的身份验证结果输入到 预设的基础模型, 得到目标用户的声纹模型, 该声纹模型接收到语音信号后, 判断产生语音信号的人是否是目标用户说话的声音。 基础模型是指神经网络模 型, 例如 BP神经网络模型。 BP神经网络是一种对非线性可微分函数进行权值训 练的多层网络。 它的最大特点是仅仅借助样本数据,无需建立系统的数学模型,就 可对系统实现由 m个输入神经元的模式向量 p组成的 pm空间到 yn空间 n (为输出节 点数)的高度非线性映射。 上述 a级映射和 b级映射, 两个映射的过程不分先后。 b 级映射的激活函数可以采用 Sigmoid, Sigmoid函数是一个在生物学中常见的 S型 的函数, 也称为 S型生长曲线, 主要用作神经网络的阈值函数, 在物理意义上最 为接近生物神经元, 其非线性激活函数的形式是 o(x)=l/(l+e A 该公式中, x是 输入的语音声学特征, e是自然常数, 是数学科的一种法则, 约为 2.71828。
[0047] 参照图 2, 本实施例中, 所述深度学习模型包括多层模型层, 所述将多个所述 语音声学特征输入基于神经网络训练的深度学习模型中, 集合成至少一个簇结 构的步骤包括:
[0048] S21、 将多个所述语音声学特征输入深度学习模型中;
[0049] S22、 选取多个所述语音声学特征中的任一时间节点 t, 以距离该时间节点 t的每 tn时间内的语音声学特征建立第 n模型层, n为正整数;
[0050] S23、 选择所述多层模型层中的目标模型层, 并获取所述目标模型层上生成的 至少一个簇结构。
[0051] 如上述步骤 S21所述, 多个语音声学特征均是从一段连续的语音信号中提取出 来的, 因而多个语音声学特征也是连续的。 将多个所述语音声学特征输入到深 度学习模型中时, 也是按照时间顺序来输入的。
[0052] 如上述步骤 S22所述, 多个语音声学特征均是一段连续的声音信号, 合在一起 也是一段连续的声音信号, 在该多个语音声学特征中, 选取任一时间节点 t, 然 后以距离 t时刻在 tn时间段内的语音声学特征进行集合, 在其中一个模型层上形 成簇结构。 由于深度学习模型具有多层模型层, 每个模型层上选择的时间节点 t 与距离 t时刻的时间段 tn不一样, 每层模型层生成的簇结构的数量是不完全相同 的。 比如该多个语音声学特征一共有 10秒, 即 10000ms, 选择时间节点是第 2000 ms , 建立第一模型层, 每间隔 tl(lms)的时间内建立第一模型层, 则第一模型层 共有 10000帧。 然后建立第二模型层, 取 t2为 2ms, 每隔 2ms的时间内建立第二模 型层, 第二模型层共有 500帧。
[0053] 如上述步骤 S23所述, 经过深度学习模型学习后, 得到了多个模型层, 每个模 型层中均有多个簇结构, 系统再选择其中一个模型层作为目标模型层, 选择目 标模型层上的簇结构作为后续生成声纹模型的参数。
[0054] 参照图 3 , 在一具体实施例中, 建立 5层模型层, 上述步骤 S22包括如下步骤:
[0055] 5221、 选择多个所述语音声学特征中的任一时间节点 t, 以距离该时间节点 t的 每 tl时间内的语音声学特征建立第一模型层;
[0056] 5222、 在第一模型层上, 以距离该时间节点 t的每 t2时间内的语音声学特征建立 第二模型层;
[0057] 5223、 在第二模型层上, 以距离该时间节点 t的每 t3时间内的语音声学特征建立 第三模型层;
[0058] 5224、 在第三模型层上, 以距离该时间节点 t的每 t4时间内的语音声学特征建立 第四模型层;
[0059] 5225、 在第四模型层上, 以距离该时间节点 t的每 t5时间内的语音声学特征建立 第五模型层, 其中, tl<t2<t3<t4<t5。
[0060] 如上述 S221步骤所述, 选取任一时间节点 t, 比如该语音声学特征有 10秒, 即 1 0000ms , 选择时间节点是第 2000ms , 建立第一模型层, 每间隔 tl(lms)的时间内 建立第一模型层, 则第一模型层共有 10000帧。
[0061] 在步骤 S222中, 在第一模型层的基础上, 仍然是选择时间节点是第 2000ms, 每 隔 t2(2ms)的时刻内建立第二模型层, 则第二模型层共有 5000帧。 在步骤 S223中 , 在第二模型层的基础上, 仍然是选择时间节点是第 2000ms, 每隔 t3(3ms)的时 刻内建立第三模型层, 则第三模型层共有 3334帧。 在步骤 S224中, 在第三模型 层的基础上, 仍然是选择时间节点是第 2000ms, 每隔 t4(4ms)的时刻内建立第四 模型层, 则第四模型层共有 2500帧。 在步骤 S225中, 在第四模型层的基础上, 仍然是选择时间节点是第 2000ms, 每隔 t5(8ms)的时刻内建立第二模型层, 则第 二模型层共有 1250帧。 最终, 将这第五模型层上的 1250帧集合成为簇结构, 经 过五层深度学习模型, 最终得到 1250个簇结构。
[0062] 参照图 4, 进一步地, 本实施例中, 所述将所述特征向量参数以及所述目标用 户的身份验证结果输入预设的基础模型, 得到与所述目标用户对应的声纹模型 的步骤, 包括:
[0063] S51、 将所述声纹模型的特征向量参数进行降维;
[0064] S52、 将所述降维后的特征向量参数以及所述目标用户的身份验证结果输入预 设的基础模型, 得到声纹模型。
[0065] 上述步骤中, 系统利用基于概率的线性判别分析 (Linear Discriminant Analysis , LDA) 来进行降维。 之后进行目标用户的声纹的模型设计。 同时, 输出层采 取 Softmax函数计算结果, 所有节点均采用 [-0.05~0.05]区间的均匀随机权重初始 化, 偏置初始为 0, 得到最终的声纹模型。 softmax函数的输入是一个向量, 而其 输出也是一个向量, 向量中的每个元素都是介于 0和 1之间的概率值。 偏置训练 模型的时候, 每一次训练得到的训练集预测标签与原始真实标签的偏离程度, 如果此偏离程度过小, 则会导致过拟合的发生, 因为可能将训练集中的噪声也 学习了。 所以说偏置刻画了学习算法本身的拟合能力, 如果拟合能力不好, 偏 置较大, 出现欠拟合; 反之拟合能力过好, 偏置较小, 容易出现过拟合。 在训 练的时候可以发现这个偏置理论上应该是逐渐变小的, 表明我们的模型正在不 断学习有用的东西。
[0066] 参照图 5 , 本实施例中, 所述提取分帧后的语音信号的语音声学特征的步骤包 括:
[0067] S11、 将分帧后的语音信号进行快速傅里叶变换计算, 得到能量谱;
[0068] S12、 将所述能量谱输入梅尔尺度的三角滤波器组, 输出共振峰特征;
[0069] S13、 将所述共振峰特征经离散余弦变换, 得到语音声学特征。
[0070] 在上述步骤 S11中, 将分帧后提取出的有效语音信号进行快速傅里叶变换, 将 时域的语音信号转换成频域的能量谱。 快速傅里叶变换 (FFT) , 是离散傅氏变 换的快速算法, 它是根据离散傅氏变换的奇、 偶、 虚、 实等特性, 对离散傅立 叶变换的算法进行改进获得的。
[0071] 在上述步骤 S12中, 共振峰是反映声道谐振特性的重要特征, 它代表了发音信 息的最直接的来源, 而且人在语音感知中利用了共振峰信息。 所以共振峰是语 音信号处理中非常重要的特征参数, 已经广泛地用作语音识别的主要特征和语 音编码传输的基本信息。 共振峰信息包含在频率包络之中, 因此共振峰参数提 取的关键是估计自然语音频谱包络, 一般认为谱包络中的最大值就是共振峰。 之后将能量谱输入梅尔尺度的三角滤波器计算每个滤波器组输出的对数能量, 滤波器组输出的特征又称为 Filter Bank(FBANK)特征。 使用梅尔刻度滤波器组过 滤, 这一目的是因为频域信号有很多冗余, 滤波器组可以对频域的幅值进行精 简, 每一个频段用一个值来表示; 过滤的具体步骤是将快速傅里叶变换后得到 的频谱分别跟每一个滤波器进行频率相乘累加, 得到的值即为该帧数据在在该 滤波器对应频段的能量值。
[0072] 在上述步骤 S 13中, 将共振峰特征经对数能量计算后, 经离散余弦变换就可得 到 MFCC系数 (mel frequency cepstrum coefficient) , 亦即 MFCC声学特征。 由于人 耳对声音的感知并不是线性的, 用 log这种非线性关系更好描述。 取完 log以后才 可以进行倒谱分析。 因此, 将能量值进行对数计算, 得到对数能量。 因为离散 余弦变换的结果没有虚部, 更好计算, 因此, 将对数能量进行离散余弦变换, 最终得到 MFCC系数 (mel frequency cepstrum coefficient) , 亦即 MFCC声学特征。 [0073] 进一步地, 上述得到声纹模型的步骤之后包括:
[0074] S6、 将待验证语音信号输入所述声纹模型中, 得到所述声纹模型输出的身份验 证结果。
[0075] 如上述步骤 S6所述, 建立好声纹模型后, 该声纹模型具有一个接收语音信号的 端口。 当接收到语音信号后, 该声纹模型将该语音信号进行计算, 若是目标用 户的语音信号, 则输出目标正确的信号; 若不是目标用户的语音信号, 则输出 目标错误的信号。
[0076] 参照图 6 , 进一步地, 所述将分帧后的语音信号进行快速傅里叶变换计算的步 骤包括:
[0077] Si l l、 将分帧后的语音信号进行预加重处理;
[0078] S112、 将预加重处理后的语音信号加窗;
[0079] S113、 通过语音端点检测, 提取出含有说话声音的有效部分语音信号;
[0080] S114、 将所述有效部分语音信号进行快速傅里叶变换计算。
[0081] 在上述步骤 Si l l中, 将语音信号进行预加重处理, 因语音信号里还包含有一些 杂音、 噪音, 如果直接将语音信号进行声纹建模处理, 会得到一些包含杂音、 噪音的影响, 建立出来的模型不准确, 会导致识别错误率的提升。 直接提取有 效语音, 就是采用语音端点检测的方法来实现, 即识别该语音中是从哪一时刻 开始是人开始讲话, 哪一时刻开始是人结束讲话。 语音端点检测的主要依据原 理是: 包含有人讲话语音的音频文件的语音频谱比不包含有人语音的音频文件 的语音频谱高, 因此, 在提取出有效语音前, 先将语音信号进行预加重, 即放 大语音信号, 使含有人讲话部分的频谱更高, 两者的差更明显, 更好的进行语 音端点检测。
[0082] 在上述步骤 S112中, 语音信号处理常常要达到的一个目标, 就是弄清楚语音中 各个频率成分的分布。 做这件事情的数学工具是傅里叶变换。 傅里叶变换要求 输入信号是平稳的。 而语音在宏观上来看是不平稳的。 但是从微观上来看, 语 音信号就可以看成平稳的, 就可以截取出来做傅里叶变换了。 加窗的目的是让 一帧信号的幅度在两端渐变到 0。 渐变到 0对傅里叶变换有好处, 可以提高变换 结果 (即频谱) 的分辨率。 [0083] 在上述步骤 S113中, 因语音信号里还包含有一些杂音、 噪音, 如果直接将语音 信号进行声纹建模处理, 会得到一些包含杂音、 噪音的影响, 建立出来的模型 不准确, 直接导致加大识别错误率。 直接提取有效语音, 就是采用语音端点检 测的方法来实现, 即识别该语音中是从哪一时刻开始是人开始讲话, 哪一时刻 开始是人结束讲话。 通过端点检测, 区分语音与噪声, 并提取出有效的语音部 分。 人在说话时也会有停顿。 提取出有效部分的语音, 即将人说话时人停顿时 的噪音部分去掉, 只提取出人说话部分的有效语音。
[0084] 在上述步骤 S114中, 快速傅里叶变换 (FFT) , 是离散傅氏变换的快速算法, 它是根据离散傅氏变换的奇、 偶、 虚、 实等特性, 对离散傅立叶变换的算法进 行改进获得的。 这样可以计算得出一段语音中说话人的语音声学特征。
[0085] 参照图 7 , 进一步地, 所述得到声纹模型的步骤之后包括:
[0086] S7、 接收用户对所述声纹模型标记的属性信息, 所述属性信息包括所述目标用 户的性别、 年龄、 民族。
[0087] 在上述 S7步骤中, 将声纹模型建立后, 系统接收用户对声纹模型添加的标记, 标记该声纹模型对应的目标用户的个人信息, 包括性别、 年龄、 民族、 身高、 体重等。 因为声纹信息与发声的器官有关, 发声控制器官包括声带、 软颚、 舌 头、 牙齿、 唇等; 发声共鸣器包括咽腔、 口腔、 鼻腔。 发声的器官相近的人, 发出的声音具有一定的共性或比较接近, 因此, 属性信息相同的人的声纹信息 会有比较相近。 收集多个人的声纹信息后, 将其进行归纳总结, 便于找出声纹 信息与人的一些关系。
[0088] 参照图 8 , 进一步地, 本实施例中, 所述提取分帧后的语音信号的语音声学特 征的步骤包括:
[0089] S14、 识别输入的分帧后的语音信号的语音内容;
[0090] S15、 判断所述语音内容的发声部位;
[0091] S16、 根据所述发声部位将所述语音信号拆分;
[0092] S17、 分别对拆分后的语音信号提取语音声学特征。
[0093] 在上述步骤 S14中, 识别输入的分帧后的语音信号的语音内容, 即通过语音识 别的手段, 将语音信号识别出来, 识别出说话人的具体说话文本信息。 [0094] 在上述步骤 S15中, 判断所述语音内容的发声部位, 是根据上述 S14中识别出的 语音内容, 读取该语音内容的拼音或者是音标, 根据拼音或者音标的内容来判 断发声部位。 常用的主要发声部位有喉、 舌头、 鼻、 牙齿等。 例如在普通话中 , 根据不同的声母确定对应的发声部位。 具体的声母与发声部位对应的表格如 下:
[0095] [表 1]
Figure imgf000013_0001
[0096] 在上述步骤 S16中, 判断语音内容的发声部位后, 上溯回查语音信号的发声部 位, 根据语音信号对应的发声部位, 将语音信号拆分成多段, 每一段语音信号 都对应一个发声部位。 例如, 一段时长为 10秒的语音信号, 第 0-2秒的语音内容 中均包含有 b或 p或 m的声母, 第 3-5秒的语音内容均包含有 j或 q或 x的声母, 第 6-1 0秒的语音内容均包含有 d或 t或 n或 1的声母, 那么, 将该语音信号拆分成三段语 音信号。 第一段是第 0-2秒的语音内容, 第二段是第 3-5秒的语音内容, 第三段是 第 6- 10秒的语音内容。
[0097] 在上述步骤 S17对这三段语音内容分别提取声学特征, 然后分别输入后面的深 度学习模型中进行计算。
[0098] 综上所述, 本申请的建立声纹模型的方法, 将提取出的语音声学特征基于深度 神经网络训练中得出簇结构, 然后将簇结构进行坐标映射和激活函数计算, 得 出的声纹模型, 可以降低声纹模型的声音识别错误率。
[0099] 参照图 9, 本申请还提出一种建立声纹模型的装置, 包括:
[0100] 提取模块 1, 用于对输入的目标用户的语音信号分帧, 分别提取分帧后的语音 信号的语音声学特征; [0101] 簇结构模块 2, 用于将多个所述语音声学特征输入基于神经网络训练的深度学 习模型中, 集合成至少一个簇结构;
[0102] 计算模块 3, 用于计算至少一个所述簇结构的平均值和标准差;
[0103] 特征向量模块 4, 用于将所述平均值和标准差进行坐标变换以及激活函数计算 , 得到特征向量参数;
[0104] 模型模块 5 , 用于将所述特征向量参数以及所述目标用户的身份验证结果输入 预设的基础模型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验 证输入的语音信号是否为所述目标用户的。
[0105] 本实施例中, 提取模块 1中的声纹是用电声学仪器显示的携带言语信息的声波 频谱。 人类语言的产生是人体语言中枢与发音器官之间一个复杂的生理物理过 程, 人在讲话时使用的发声器官 (舌、 牙齿、 喉头、 肺、 鼻腔)在尺寸和形态方面 每个人的差异很大, 所以任何两个人的声纹都有差异。 语音信号是一种搭载着 特定的信息模拟信号, 其来源是由人发出的声音信号转换成的语音信号。 每个 人的声纹不一样, 因而, 相同的人说出同样的话产生的声音而后转换成的语音 信号也是不一样的。 因而, 语音信号里所包含的语音声学特征也是不一样的。 语音声学特征是每个人发出的声音中包含的声纹信息。 分帧是指将连续的语音 信号分成多段。 人在正常讲话的语速下, 音素的持续时间大约是 50~200毫秒, 所以帧长一般取为小于 50毫秒。 从微观上来看, 它又必须包括足够多的振动周 期。 语音的音频, 男声在 100赫兹左右, 女声在 200赫兹左右, 换算成周期就是 1 0毫秒和 5毫秒。 一般一帧要包含多个周期, 所以一般取至少 20毫秒。 所谓的语 音信号包括一段连续的语音, 例如一个句子、 一段话等。 所述语音声学特征可 为所述语音片段的梅尔频率倒谱系数 (MFCC), 或感知线性预测系数 (PLP), 或滤 波器组特征 (Filter Bank Feature)等。 当然, 所述语音声学特征也可为所述语音片 段的原始语音数据。 提取模块 1将目标用户的语音信号中的语音声学特征提取出 来, 是将需要建立声纹模型的人说话的声音信号提取出来, 非目标用户说话产 生的语音信号则不进行提取。 语音声学特征是从一段连续的语音信号中提取出 来的包含有人说话的部分的语音信号, 因而也是一段连续的语音信号。 提取模 块 1将语音信号分帧后, 得到多段语音信号, 分别提取出每段语音信号的语音声 学特征, 则得到多个语音声学特征。
[0106] 语音声学特征是从分帧的语音信号中提取出来的, 是一段语音信号, 簇结构模 块 2将该语音信号输入到神经网络训练模型中, 目的是将语音声学特征进行集合 计算, 方便统计与计算语音声学特征。 簇结构模块 2是将该段语音声学特征的集 合, 能体现出多个语音声学特征集合在一起的相同的共性特征。
[0107] 计算模块 3将多个语音声学特征输入基于神经网络的深度学习模型后, 输出得 到至少一个簇结构 x-1、 x-2、 x-1、 ...xn, 假设簇结构是一个 p维向量, 则 xn=(xil, xi2,...,xip) T (i=l,2,...n)。 计算这些簇结构的均值和标准差。 得到簇结构的平均值 和标准差。 其中, 计算多个簇结构的平均值的方法为:首先计算模块 3根据公式
Figure imgf000015_0001
, 计算每个分量的平均值, 然后计算模块 3再根据公式: xO=(xl,x2,...,xj) T 计 算出 p维的平均向量, 计算模块 3将 p维的平均向量组合形成簇结构的平均值: E(x )。 计算模块 3计算多个簇结构的标准差的公式为: D(x)=E{ [x- E(x)][x-E(x)] T}。
[0108] 特征向量模块 4将上述的 E(x)和 D(x)经过 a级映射和 b级映射。 其中, a级映射是 将簇结构的平均值和标准差进行坐标变换, b级映射是将簇结构的平均值和标准 差通过激活函数计算后得出一个非线性结果, 该结果即为建立声纹模型的特征 向量参数。
[0109] 然后模型模块 5将特征向量参数以及目标用户的身份验证结果输入到预设的基 础模型, 得到目标用户的声纹模型, 该声纹模型接收到语音信号后, 判断产生 语音信号的人是否是目标用户说话的声音。 基础模型是指神经网络模型, 例如 B P神经网络模型。 BP神经网络是一种对非线性可微分函数进行权值训练的多层网 络。 它的最大特点是仅仅借助样本数据,无需建立系统的数学模型,就可对系统实 现由 m个输入神经元的模式向量 p组成的 pm空间到 yn空间 n (为输出节点数)的高度 非线性映射。 上述 a级映射和 b级映射, 两个映射的过程不分先后。 b级映射的激 活函数可以采用 Sigmoid, Sigmoid函数是一个在生物学中常见的 S型的函数, 也 称为 S型生长曲线, 主要用作神经网络的阈值函数, 在物理意义上最为接近生物 神经元, 其非线性激活函数的形式是 o(x)=l/(l+e 该公式中, X是输入的语音 声学特征, e是自然常数, 是数学科的一种法则, 约为 2.71828。
[0110] 参照图 10, 本实施例中, 所述深度学习模型包括多层模型层, 所述簇结构模块 2包括:
[0111] 输入单元 21, 用于将多个所述语音声学特征输入深度学习模型中;
[0112] 建立单元 22, 用于选取多个所述语音声学特征中的任一时间节点 t, 以距离该 时间节点 t的每 tn时间内的语音声学特征建立第 n模型层, n为正整数;
[0113] 选择单元 23 , 用于选择所述多层模型层中的目标模型层, 并获取所述目标模型 层上生成的至少一个簇结构。
[0114] 本实施例中, 多个语音声学特征均是从一段连续的语音信号中提取出来的, 因 而多个语音声学特征也是连续的。 输入单元 21将多个所述语音声学特征输入到 深度学习模型中时, 也是按照时间顺序来输入的。
[0115] 多个语音声学特征均是一段连续的声音信号, 合在一起也是一段连续的声音信 号, 在该多个语音声学特征中中, 建立单元 22选取任一时间节点 t, 然后以距离 t 时刻在 tn时间段内的语音声学特征集合, 在其中一个模型层上形成簇结构。 由于 深度学习模型具有多层模型层, 每个模型层上选择的时间节点 t与距离 t时刻的时 间段 tn不一样, 每层模型层生成的簇结构的数量是不完全相同的。 比如该多个语 音声学特征一共有 10秒, 即 10000ms, 选择时间节点是第 2000ms, 建立单元 22建 立第一模型层, 每间隔 tl(lms)的时间内建立第一模型层, 则第一模型层共有 100 00帧。 然后建立单元 22建立第二模型层, 取 2为 2ms, 每隔 2ms的时间内建立第二 模型层, 第二模型层共有 500帧。
[0116] 经过深度学习模型学习后, 得到了多个模型层, 每个模型层中均有多个簇结构 , 选择单元 23再选择其中一个模型层上的簇结构作为后续生成声纹模型的参数
[0117] 参照图 11, 进一步地, 所述模型模块 4包括:
[0118] 降维单元 51, 用于将所述声纹模型的特征向量参数进行降维;
[0119] 模型单元 52, 用于将所述降维后的特征向量参数输入预设的基础模型, 得到声 纹模型。 [0120] 本实施例中, 降维单元 51利用基于概率的线性判别分析 (Linear Discriminant Analysis, LDA) 来进行降维。 之后模型单元 52进行目标用户的声纹的模型设计 。 同时, 输出层采取 Softmax函数计算结果, 所有节点均采用 [-0.05~0.05]区间的 均匀随机权重初始化, 偏置初始为 0, 得到最终的声纹模型。 softmax函数的输入 是一个向量, 而其输出也是一个向量, 向量中的每个元素都是介于 0和 1之间的 概率值。 偏置训练模型的时候, 每一次训练得到的训练集预测标签与原始真实 标签的偏离程度, 如果此偏离程度过小, 则会导致过拟合的发生, 因为可能将 训练集中的噪声也学习了。 所以说偏置刻画了学习算法本身的拟合能力, 如果 拟合能力不好, 偏置较大, 出现欠拟合; 反之拟合能力过好, 偏置较小, 容易 出现过拟合。 在训练的时候可以发现这个偏置理论上应该是逐渐变小的, 表明 我们的模型正在不断学习有用的东西。
[0121] 参照图 12, 进一步地, 所述提取模块 1包括:
[0122] 计算单元 11, 用于将分帧后的语音信号进行快速傅里叶变换计算, 得到能量谱
[0123] 输入单元 12, 用于将所述能量谱输入梅尔尺度的三角滤波器组, 输出共振峰特 征;
[0124] 变换单元 13 , 用于将所述共振峰特征经离散余弦变换, 得到语音声学特征。
[0125] 本实施例中, 计算单元 11将分帧后提取出的有效语音信号进行快速傅里叶变换 , 将时域的语音信号转换成频域的能量谱。 快速傅里叶变换 (FFT) , 是离散傅 氏变换的快速算法, 它是根据离散傅氏变换的奇、 偶、 虚、 实等特性, 对离散 傅立叶变换的算法进行改进获得的。 共振峰是反映声道谐振特性的重要特征, 它代表了发音信息的最直接的来源, 而且人在语音感知中利用了共振峰信息。 所以共振峰是语音信号处理中非常重要的特征参数, 已经广泛地用作语音识别 的主要特征和语音编码传输的基本信息。 共振峰信息包含在频率包络之中, 因 此共振峰参数提取的关键是估计自然语音频谱包络, 一般认为谱包络中的最大 值就是共振峰。 之后输入模块 12将能量谱输入梅尔尺度的三角滤波器组计算每 个滤波器组输出的对数能量, 滤波器组输出的特征又称为 Filter Bank(FBANK)特 征。 使用梅尔刻度滤波器组过滤, 这一目的是因为频域信号有很多冗余, 滤波 器组可以对频域的幅值进行精简, 每一个频段用一个值来表示; 过滤的具体步 骤是将快速傅里叶变换后得到的频谱分别跟每一个滤波器进行频率相乘累加, 得到的值即为该帧数据在在该滤波器对应频段的能量值。 将共振峰特征经对数 能量计算后, 经变换单元 13进行离散余弦变换就可得到 MFCC系数 (mel frequency cepstrum coefficient), 亦即 MFCC声学特征。 由于人耳对声音的感知并不是线性 的, 用 log这种非线性关系更好描述。 取完 log以后才可以进行倒谱分析。 因此, 将能量值进行对数计算, 得到对数能量。 因为离散余弦变换的结果没有虚部, 更好计算, 因此, 将对数能量进行离散余弦变换, 最终得到 MFCC系数 (mel frequency cepstrum coefficient) , 亦即 MFCC声学特征。
[0126] 进一步地, 上述建立声纹模型的装置还包括:
[0127] 验证模块 6 , 用于将待验证语音信号输入所述声纹模型中, 得到所述声纹模型 输出的身份验证结果。
[0128] 本实施例中, 建立好声纹模型后, 该声纹模型具有一个接收语音信号的端口。
验证模块 6接收到语音信号后, 该声纹模型将该语音信号进行计算, 若是目标用 户的语音信号, 则验证模块 6输出目标正确的信号; 若不是目标用户的语音信号 , 则验证模块 6输出目标错误的信号。
[0129] 参照图 13 , 进一步地, 所述建立声纹模型的装置还包括:
[0130] 属性模块 7, 用于接收用户对所述声纹模型标记的属性信息, 所述属性信息包 括所述目标用户的性别、 年龄、 民族。
[0131] 本实施例中, 将声纹模型建立后, 属性模块 7接收用户对声纹模型添加的标记 , 标记该声纹模型对应的目标用户的个人信息, 包括性别、 年龄、 民族、 身高 、 体重等。 因为声纹信息与发声的器官有关, 发声控制器官包括声带、 软颚、 舌头、 牙齿、 唇等; 发声共鸣器包括咽腔、 口腔、 鼻腔。 发声的器官相近的人 , 发出的声音具有一定的共性或比较接近, 因此, 属性信息相同的人的声纹信 息会有比较相近。 收集多个人的声纹信息后, 将其进行归纳总结, 便于找出声 纹信息与人的一些关系。
[0132] 参照图 14, 进一步地, 所述提取模块 1还包括:
[0133] 识别单元 14, 用于识别输入的分帧后的语音信号的语音内容; [0134] 判断单元 15 , 用于判断所述语音内容的发声部位;
[0135] 拆分单元 16 , 用于根据所述发声部位将所述语音信号拆分;
[0136] 提取单元 17 , 用于分别对拆分后的语音信号提取语音声学特征。
[0137] 本实施例中, 识别单元 14识别输入的语音信号的语音内容, 即通过语音识别的 手段, 将语音信号识别出来, 识别出说话人的具体说话文本信息。
[0138] 判断单元 15判断所述语音内容的发声部位, 是根据上述识别单元 14中识别出的 语音内容, 读取该语音内容的拼音或者是音标, 根据拼音或者音标的内容来判 断发声部位。 常用的主要发声部位有喉、 舌头、 鼻、 牙齿等。 例如在普通话中 , 根据不同的声母确定对应的发声部位。 具体的声母与发声部位对应的表格如 下:
[0139] [表 2]
Figure imgf000019_0001
[0140] 判断单元 15判断语音内容的发声部位后, 拆分单元 16上溯回查语音信号的发声 部位, 然后拆分单元 16根据语音信号对应的发声部位, 将语音信号拆分成多段 , 每一段语音信号都对应一个发声部位。 例如, 一段时长为 10秒的语音信号, 第 0-2秒的语音内容中均包含有 b或 p或 m的声母, 第 3-5秒的语音内容均包含有 j或 q或 x的声母, 第 6-10秒的语音内容均包含有 d或 t或 n或 1的声母, 那么, 拆分单元 1 6将该语音信号拆分成三段语音信号。 第一段是第 0-2秒的语音内容, 第二段是第 3-5秒的语音内容, 第三段是第 6-10秒的语音内容。 然后提取单元 17分别对这三 段语音内容提取出声学特征, 然后分别输入后面的深度学习模型中计算。
[0141] 综上所述, 本申请的建立声纹模型的装置, 将提取出的语音声学特征基于深度 神经网络训练中得出簇结构, 然后将簇结构进行坐标映射和激活函数计算, 得 出的声纹模型, 可以降低声纹模型的声音识别错误率。
[0142] 参照图 15, 本申请实施例中还提供一种计算机设备, 该计算机设备可以是服务 器, 其内部结构可以如图 15所示。 该计算机设备包括通过系统总线连接的处理 器、 存储器、 网络接口和数据库。 其中, 该计算机设计的处理器用于提供计算 和控制能力。 该计算机设备的存储器包括非易失性存储介质、 内存储器。 该非 易失性存储介质存储有操作系统、 计算机可读指令和数据库。 该内存器为非易 失性存储介质中的操作系统和计算机可读指令的运行提供环境。 该计算机设备 的数据库用于存储建立声纹模型等数据。 该计算机设备的网络接口用于与外部 的终端通过网络连接通信。 该计算机可读指令在执行时, 执行如上述各方法的 实施例的流程。 本领域技术人员可以理解, 图 15中示出的结构, 仅仅是与本申 请方案相关的部分结构的框图, 并不构成对本申请方案所应用于其上的计算机 设备的限定。
[0143] 本申请一实施例还提供一种计算机非易失性可读存储介质, 其上存储有计算机 可读指令, 该计算机可读指令在执行时, 执行如上述各方法的实施例的流程。 以上所述仅为本申请的优选实施例, 并非因此限制本申请的专利范围, 凡是利 用本申请说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接运 用在其他相关的技术领域, 均同理包括在本申请的专利保护范围内。

Claims

权利要求书
[权利要求 1] 一种建立声纹模型的方法, 其特征在于, 包括:
对输入的目标用户的语音信号分帧, 分别提取分帧后的语音信号的语 音声学特征;
将多个所述语音声学特征输入基于神经网络训练的深度学习模型中, 集合成至少一个簇结构;
计算至少一个所述簇结构的平均值和标准差;
将所述平均值和标准差进行坐标变换以及激活函数计算, 得到特征向 量参数;
将所述特征向量参数以及所述目标用户的身份验证结果输入预设的基 础模型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验 证输入的语音信号是否为所述目标用户的。
[权利要求 2] 如权利要求 i所述的建立声纹模型的方法, 其特征在于, 所述深度学 习模型包括多层模型层, 所述将多个所述语音声学特征输入基于神经 网络训练的深度学习模型中, 集合成至少一个簇结构的步骤包括: 将多个所述语音声学特征输入深度学习模型中; 选取多个所述语音声学特征中的任一时间节点 t, 以距离该时间节点 t 的每 tn时间内的语音声学特征建立第 n模型层, n为正整数; 选择所述多层模型层中的目标模型层, 并获取所述目标模型层上生成 的至少一个簇结构。
[权利要求 3] 如权利要求 1所述的建立声纹模型的方法, 其特征在于, 所述将所述 特征向量参数以及所述目标用户的身份验证结果输入预设的基础模型 , 得到与所述目标用户对应的声纹模型的步骤, 包括:
将所述声纹模型的特征向量参数进行降维;
将所述降维后的特征向量参数以及所述目标用户的身份验证结果输入 预设的基础模型, 得到声纹模型。
[权利要求 4] 如权利要求 1所述的建立声纹模型的方法, 其特征在于, 所述提取分 帧后的语音信号的语音声学特征的步骤包括: 将分帧后的语音信号进行快速傅里叶变换计算, 得到能量谱;
将所述能量谱输入梅尔尺度的三角滤波器组, 输出共振峰特征; 将所述共振峰特征经离散余弦变换, 得到语音声学特征。
[权利要求 5] 如权利要求 1所述的建立声纹模型的方法, 其特征在于, 所述得到声 纹模型的步骤之后包括:
将待验证语音信号输入所述声纹模型中, 得到所述声纹模型输出的身 份验证结果。
[权利要求 6] 如权利要求 1所述的建立声纹模型的方法, 其特征在于, 所述得到声 纹模型的步骤之后包括:
接收用户对所述声纹模型标记的属性信息, 所述属性信息包括所述目 标用户的性别、 年龄、 民族。
[权利要求 7] 一种建立声纹模型的装置, 其特征在于, 包括: 提取模块, 用于对输 入的目标用户的语音信号分帧, 分别提取分帧后的语音信号的语音声 学特征; 簇结构模块, 用于将多个所述语音声学特征输入基于神经网 络训练的深度学习模型中, 集合成至少一个簇结构; 计算模块, 用于 计算至少一个所述簇结构的平均值和标准差; 特征向量模块, 用于将 所述平均值和标准差进行坐标变换以及激活函数计算, 得到特征向量 参数; 模型模块, 用于将所述特征向量参数以及所述目标用户的身份 验证结果输入预设的基础模型, 得到与所述目标用户对应的声纹模型 , 所述声纹模型用于验证输入的语音信号是否为所述目标用户的。
[权利要求 8] 如权利要求 7所述的建立声纹模型的装置, 其特征在于, 所述深度学 习模型包括多层模型层, 所述簇结构模块包括: 输入单元, 用于将多 个所述语音声学特征输入深度学习模型中; 建立单元, 用于选取多个 所述语音声学特征中的任一时间节点 t, 以距离该时间节点 t的每 tn时 间内的语音声学特征建立第 n模型层, n为正整数; 选择单元, 用于选 择所述多层模型层中的目标模型层, 并获取所述目标模型层上生成的 至少一个簇结构。
[权利要求 9] 如权利要求 7所述的建立声纹模型的装置, 其特征在于, 所述模型模 块包括: 降维单元, 用于将所述声纹模型的特征向量参数进行降维; 模型单元, 用于将所述降维后的特征向量参数输入预设的基础模型, 得到声纹模型。
[权利要求 10] 如权利要求 7所述的建立声纹模型的装置, 其特征在于, 所述提取模 块包括: 计算单元, 用于将分帧后的语音信号进行快速傅里叶变换计 算, 得到能量谱; 输入单元, 用于将所述能量谱输入梅尔尺度的三角 滤波器组, 输出共振峰特征; 变换单元, 用于将所述共振峰特征经离 散余弦变换, 得到语音声学特征。
[权利要求 11] 如权利要求 7所述的建立声纹模型的装置, 其特征在于, 所述建立声 纹模型的装置还包括: 验证模块, 用于将待验证语音信号输入所述声 纹模型中, 得到所述声纹模型输出的身份验证结果。
[权利要求 12] 如权利要求 7所述的建立声纹模型的装置, 其特征在于, 所述建立声 纹模型的装置还包括: 属性模块 7 , 用于接收用户对所述声纹模型标 记的属性信息, 所述属性信息包括所述目标用户的性别、 年龄、 民族
[权利要求 13] 一种计算机设备, 包括存储器和处理器, 所述存储器存储有计算机可 读指令, 其特征在于, 所述处理器执行所述计算机可读指令时实现一 种建立声纹模型的方法, 该建立声纹模型的方法, 包括: 对输入的目 标用户的语音信号分帧, 分别提取分帧后的语音信号的语音声学特征 ; 将多个所述语音声学特征输入基于神经网络训练的深度学习模型中 , 集合成至少一个簇结构; 计算至少一个所述簇结构的平均值和标准 差; 将所述平均值和标准差进行坐标变换以及激活函数计算, 得到特 征向量参数; 将所述特征向量参数以及所述目标用户的身份验证结果 输入预设的基础模型, 得到与所述目标用户对应的声纹模型, 所述声 纹模型用于验证输入的语音信号是否为所述目标用户的。
[权利要求 14] 如权利要求 13所述的计算机设备, 其特征在于, 所述深度学习模型包 括多层模型层, 所述将多个所述语音声学特征输入基于神经网络训练 的深度学习模型中, 集合成至少一个簇结构的步骤包括: 将多个所述 语音声学特征输入深度学习模型中; 选取多个所述语音声学特征中的 任一时间节点 t, 以距离该时间节点 t的每 tn时间内的语音声学特征建 立第 n模型层, n为正整数; 选择所述多层模型层中的目标模型层, 并 获取所述目标模型层上生成的至少一个簇结构。
[权利要求 15] 如权利要求 13所述的计算机设备, 其特征在于, 所述将所述特征向量 参数以及所述目标用户的身份验证结果输入预设的基础模型, 得到与 所述目标用户对应的声纹模型的步骤, 包括: 将所述声纹模型的特征 向量参数进行降维; 将所述降维后的特征向量参数以及所述目标用户 的身份验证结果输入预设的基础模型, 得到声纹模型。
[权利要求 16] 如权利要求 13所述的计算机设备, 其特征在于, 所述提取分帧后的语 音信号的语音声学特征的步骤包括: 将分帧后的语音信号进行快速傅 里叶变换计算, 得到能量谱; 将所述能量谱输入梅尔尺度的三角滤波 器组, 输出共振峰特征; 将所述共振峰特征经离散余弦变换, 得到语 音声学特征。
[权利要求 17] 一种计算机非易失性可读存储介质, 其上存储有计算机可读指令, 其 特征在于, 所述计算机可读指令被处理器执行时实现一种建立声纹模 型的方法, 该建立声纹模型的方法, 包括: 对输入的目标用户的语音 信号分帧, 分别提取分帧后的语音信号的语音声学特征; 将多个所述 语音声学特征输入基于神经网络训练的深度学习模型中, 集合成至少 一个簇结构; 计算至少一个所述簇结构的平均值和标准差; 将所述平 均值和标准差进行坐标变换以及激活函数计算, 得到特征向量参数; 将所述特征向量参数以及所述目标用户的身份验证结果输入预设的基 础模型, 得到与所述目标用户对应的声纹模型, 所述声纹模型用于验 证输入的语音信号是否为所述目标用户的。
[权利要求 18] 如权利要求 17所述的计算机非易失性可读存储介质, 其特征在于, 所 述深度学习模型包括多层模型层, 所述将多个所述语音声学特征输入 基于神经网络训练的深度学习模型中, 集合成至少一个簇结构的步骤 包括: 将多个所述语音声学特征输入深度学习模型中; 选取多个所述 语音声学特征中的任一时间节点 t, 以距离该时间节点 t的每 tn时间内 的语音声学特征建立第 n模型层, n为正整数; 选择所述多层模型层中 的目标模型层, 并获取所述目标模型层上生成的至少一个簇结构。
[权利要求 19] 如权利要求 17所述的计算机非易失性可读存储介质, 其特征在于, 所 述将所述特征向量参数以及所述目标用户的身份验证结果输入预设的 基础模型, 得到与所述目标用户对应的声纹模型的步骤, 包括: 将所 述声纹模型的特征向量参数进行降维; 将所述降维后的特征向量参数 以及所述目标用户的身份验证结果输入预设的基础模型, 得到声纹模 型。
[权利要求 20] 如权利要求 17所述的计算机非易失性可读存储介质, 其特征在于, 所 述提取分帧后的语音信号的语音声学特征的步骤包括: 将分帧后的语 音信号进行快速傅里叶变换计算, 得到能量谱; 将所述能量谱输入梅 尔尺度的三角滤波器组, 输出共振峰特征; 将所述共振峰特征经离散 余弦变换, 得到语音声学特征。
PCT/CN2018/094888 2018-05-08 2018-07-06 建立声纹模型的方法、装置、计算机设备和存储介质 WO2019214047A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2019570559A JP6906067B2 (ja) 2018-05-08 2018-07-06 声紋モデルを構築する方法、装置、コンピュータデバイス、プログラム及び記憶媒体
US16/759,384 US11322155B2 (en) 2018-05-08 2018-07-06 Method and apparatus for establishing voiceprint model, computer device, and storage medium
SG11202002083WA SG11202002083WA (en) 2018-05-08 2018-07-06 Method and apparatus for establishing voiceprint model, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810433792.X 2018-05-08
CN201810433792.XA CN108806696B (zh) 2018-05-08 2018-05-08 建立声纹模型的方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2019214047A1 true WO2019214047A1 (zh) 2019-11-14

Family

ID=64092054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094888 WO2019214047A1 (zh) 2018-05-08 2018-07-06 建立声纹模型的方法、装置、计算机设备和存储介质

Country Status (5)

Country Link
US (1) US11322155B2 (zh)
JP (1) JP6906067B2 (zh)
CN (1) CN108806696B (zh)
SG (1) SG11202002083WA (zh)
WO (1) WO2019214047A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414511A (zh) * 2020-03-25 2020-07-14 合肥讯飞数码科技有限公司 自动声纹建模入库方法、装置以及设备
CN112637428A (zh) * 2020-12-29 2021-04-09 平安科技(深圳)有限公司 无效通话判断方法、装置、计算机设备及存储介质
CN115831152A (zh) * 2022-11-28 2023-03-21 国网山东省电力公司应急管理中心 一种用于实时监测应急装备发电机运行状态的声音监测装置及方法

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110246503A (zh) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 黑名单声纹库构建方法、装置、计算机设备和存储介质
CN110265040B (zh) * 2019-06-20 2022-05-17 Oppo广东移动通信有限公司 声纹模型的训练方法、装置、存储介质及电子设备
CN110211569A (zh) * 2019-07-09 2019-09-06 浙江百应科技有限公司 基于语音图谱和深度学习的实时性别识别方法
CN110491393B (zh) * 2019-08-30 2022-04-22 科大讯飞股份有限公司 声纹表征模型的训练方法及相关装置
CN110428853A (zh) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 语音活性检测方法、语音活性检测装置以及电子设备
CN110600040B (zh) * 2019-09-19 2021-05-25 北京三快在线科技有限公司 声纹特征注册方法、装置、计算机设备及存储介质
CN110780741B (zh) * 2019-10-28 2022-03-01 Oppo广东移动通信有限公司 模型训练方法、应用运行方法、装置、介质及电子设备
CN111292510A (zh) * 2020-01-16 2020-06-16 广州华铭电力科技有限公司 一种城市电缆被外力破坏的识别预警方法
IL274741A (en) * 2020-05-18 2021-12-01 Verint Systems Ltd A system and method for obtaining voiceprints for large populations
TWI807203B (zh) * 2020-07-28 2023-07-01 華碩電腦股份有限公司 聲音辨識方法及使用其之電子裝置
CN112466311B (zh) * 2020-12-22 2022-08-19 深圳壹账通智能科技有限公司 声纹识别方法、装置、存储介质及计算机设备
CN113011302B (zh) * 2021-03-11 2022-04-01 国网电力科学研究院武汉南瑞有限责任公司 一种基于卷积神经网络的雷声信号识别系统及方法
CN113179442B (zh) * 2021-04-20 2022-04-29 浙江工业大学 一种基于语音识别的视频中音频流替换方法
CN113077536A (zh) * 2021-04-20 2021-07-06 深圳追一科技有限公司 一种基于bert模型的嘴部动作驱动模型训练方法及组件
CN113421575B (zh) * 2021-06-30 2024-02-06 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN114495948B (zh) * 2022-04-18 2022-09-09 北京快联科技有限公司 一种声纹识别方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845140A (zh) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 应用于短语音条件下的说话人确认方法和装置
CN106448684A (zh) * 2016-11-16 2017-02-22 北京大学深圳研究生院 基于深度置信网络特征矢量的信道鲁棒声纹识别系统
CN106847292A (zh) * 2017-02-16 2017-06-13 平安科技(深圳)有限公司 声纹识别方法及装置
WO2017113680A1 (zh) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 声纹认证处理方法及装置
CN107680582A (zh) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 声学模型训练方法、语音识别方法、装置、设备及介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645137B2 (en) * 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
KR100679051B1 (ko) * 2005-12-14 2007-02-05 삼성전자주식회사 복수의 신뢰도 측정 알고리즘을 이용한 음성 인식 장치 및방법
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform
CN104485102A (zh) * 2014-12-23 2015-04-01 智慧眼(湖南)科技发展有限公司 声纹识别方法和装置
CN106157959B (zh) * 2015-03-31 2019-10-18 讯飞智元信息科技有限公司 声纹模型更新方法及系统
US10884503B2 (en) * 2015-12-07 2021-01-05 Sri International VPA with integrated object recognition and facial expression recognition
CN107492382B (zh) * 2016-06-13 2020-12-18 阿里巴巴集团控股有限公司 基于神经网络的声纹信息提取方法及装置
WO2018184192A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Methods and systems using camera devices for deep channel and convolutional neural network images and formats
CN110352432A (zh) * 2017-04-07 2019-10-18 英特尔公司 使用用于深度神经网络的改进的训练和学习的方法和系统
US10896669B2 (en) * 2017-05-19 2021-01-19 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
US20180358003A1 (en) * 2017-06-09 2018-12-13 Qualcomm Incorporated Methods and apparatus for improving speech communication and speech interface quality using neural networks
CN107357875B (zh) * 2017-07-04 2021-09-10 北京奇艺世纪科技有限公司 一种语音搜索方法、装置及电子设备
US11055604B2 (en) * 2017-09-12 2021-07-06 Intel Corporation Per kernel Kmeans compression for neural networks
CN107993071A (zh) * 2017-11-21 2018-05-04 平安科技(深圳)有限公司 电子装置、基于声纹的身份验证方法及存储介质
US11264037B2 (en) * 2018-01-23 2022-03-01 Cirrus Logic, Inc. Speaker identification
US10437936B2 (en) * 2018-02-01 2019-10-08 Jungle Disk, L.L.C. Generative text using a personality model
WO2020035085A2 (en) * 2019-10-31 2020-02-20 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017113680A1 (zh) * 2015-12-30 2017-07-06 百度在线网络技术(北京)有限公司 声纹认证处理方法及装置
CN105845140A (zh) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 应用于短语音条件下的说话人确认方法和装置
CN106448684A (zh) * 2016-11-16 2017-02-22 北京大学深圳研究生院 基于深度置信网络特征矢量的信道鲁棒声纹识别系统
CN106847292A (zh) * 2017-02-16 2017-06-13 平安科技(深圳)有限公司 声纹识别方法及装置
CN107680582A (zh) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 声学模型训练方法、语音识别方法、装置、设备及介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414511A (zh) * 2020-03-25 2020-07-14 合肥讯飞数码科技有限公司 自动声纹建模入库方法、装置以及设备
CN111414511B (zh) * 2020-03-25 2023-08-22 合肥讯飞数码科技有限公司 自动声纹建模入库方法、装置以及设备
CN112637428A (zh) * 2020-12-29 2021-04-09 平安科技(深圳)有限公司 无效通话判断方法、装置、计算机设备及存储介质
CN115831152A (zh) * 2022-11-28 2023-03-21 国网山东省电力公司应急管理中心 一种用于实时监测应急装备发电机运行状态的声音监测装置及方法
CN115831152B (zh) * 2022-11-28 2023-07-04 国网山东省电力公司应急管理中心 一种用于实时监测应急装备发电机运行状态的声音监测装置及方法

Also Published As

Publication number Publication date
JP2020524308A (ja) 2020-08-13
JP6906067B2 (ja) 2021-07-21
CN108806696A (zh) 2018-11-13
US20200294509A1 (en) 2020-09-17
SG11202002083WA (en) 2020-04-29
CN108806696B (zh) 2020-06-05
US11322155B2 (en) 2022-05-03

Similar Documents

Publication Publication Date Title
WO2019214047A1 (zh) 建立声纹模型的方法、装置、计算机设备和存储介质
Kinnunen Spectral features for automatic text-independent speaker recognition
CN103928023B (zh) 一种语音评分方法及系统
Lung et al. Fuzzy phoneme classification using multi-speaker vocal tract length normalization
Bhangale et al. A review on speech processing using machine learning paradigm
Patel et al. Speech recognition and verification using MFCC & VQ
CN112767958A (zh) 一种基于零次学习的跨语种音色转换系统及方法
US11335324B2 (en) Synthesized data augmentation using voice conversion and speech recognition models
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
Pawar et al. Review of various stages in speaker recognition system, performance measures and recognition toolkits
CN110648684A (zh) 一种基于WaveNet的骨导语音增强波形生成方法
Nanavare et al. Recognition of human emotions from speech processing
Kanabur et al. An extensive review of feature extraction techniques, challenges and trends in automatic speech recognition
Grewal et al. Isolated word recognition system for English language
Liu et al. AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning
CN116229932A (zh) 一种基于跨域一致性损失的语音克隆方法及系统
CN110838294B (zh) 一种语音验证方法、装置、计算机设备及存储介质
Bawa et al. Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions
Kurian et al. Connected digit speech recognition system for Malayalam language
Musaev et al. Advanced feature extraction method for speaker identification using a classification algorithm
Dalva Automatic speech recognition system for Turkish spoken language
Mital Speech enhancement for automatic analysis of child-centered audio recordings
Deo et al. Review of Feature Extraction Techniques
KR20080039072A (ko) 홈 네트워크 제어를 위한 음성인식시스템
Gupta et al. Speech Recognition using MFCC & VQ

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019570559

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18917995

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18917995

Country of ref document: EP

Kind code of ref document: A1