WO2019154107A1 - 基于记忆性瓶颈特征的声纹识别的方法及装置 - Google Patents

基于记忆性瓶颈特征的声纹识别的方法及装置 Download PDF

Info

Publication number
WO2019154107A1
WO2019154107A1 PCT/CN2019/073101 CN2019073101W WO2019154107A1 WO 2019154107 A1 WO2019154107 A1 WO 2019154107A1 CN 2019073101 W CN2019073101 W CN 2019073101W WO 2019154107 A1 WO2019154107 A1 WO 2019154107A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
layer
bottleneck
memory
model
Prior art date
Application number
PCT/CN2019/073101
Other languages
English (en)
French (fr)
Inventor
王志铭
周俊
李小龙
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to EP19750725.4A priority Critical patent/EP3719798B1/en
Priority to SG11202006090RA priority patent/SG11202006090RA/en
Priority to EP21198758.1A priority patent/EP3955246B1/en
Publication of WO2019154107A1 publication Critical patent/WO2019154107A1/zh
Priority to US16/905,354 priority patent/US20200321008A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the voiceprint is an acoustic feature that is extracted based on the speaker's spectral characteristics of the sound wave. Like a fingerprint, voiceprint, as a biological feature, reflects the traits and identity of the speaker. Voiceprint recognition, also known as speaker recognition, is a biometric authentication technique that uses the specific speaker information contained in the voice signal to automatically identify the speaker's identity. The biometric authentication technology has broad application prospects in many fields and scenarios such as identity authentication and security core.
  • the identity vector i-vector (identity vector) model is a commonly used model in voiceprint recognition systems.
  • the I-vector model considers that the speaker and channel information in speech are contained in a low-dimensional linear subspace, and each speech can be characterized by a fixed-length vector in the low-dimensional space, which is the identity authentication.
  • Vector i-vector The identity authentication vector i-vector has good discrimination, contains the identity information of the speaker, and is an important feature of voiceprint recognition and voice recognition.
  • the i-vector based voiceprint recognition generally includes the following process: calculating an acoustic statistic based on spectral features, extracting an identity authentication vector i-vector based on acoustic statistics, and then performing speaker recognition based on the i-vector. Therefore, the extraction of i-vector is very important. However, the extraction of i-vectors in the existing voiceprint recognition process is not comprehensive enough. Therefore, a more effective solution is needed to obtain a more comprehensive voiceprint feature, which further improves the accuracy of voiceprint recognition.
  • One or more embodiments of the present specification describe a method and apparatus that is capable of acquiring more comprehensive acoustic features from speaker audio such that the extraction of the identity authentication vector is more comprehensive and the accuracy of voiceprint recognition is improved.
  • a method for voiceprint recognition comprising: extracting a first spectral feature from a speaker audio; inputting the speaker audio into a memory depth neural network DNN, from the memory deep neural network a bottleneck layer extracting bottleneck feature, wherein the memory deep neural network DNN includes at least one time recursive layer and the bottleneck layer, an output of the at least one time recursive layer being connected to the bottleneck layer, a dimension of the bottleneck layer Forming a smaller dimension than other hidden layers in the memory deep neural network DNN; forming an acoustic feature of the speaker audio based on the first spectral feature and the bottleneck feature; extracting a speaker based on the acoustic feature An identity authentication vector corresponding to the audio; based on the identity authentication vector, a classification model is used for speaker recognition.
  • the first spectral features include a Mel spectral cepstral coefficient MFCC feature, and first-order differential features and second-order differential features of the MFCC features.
  • the at least one temporal recursive layer comprises a hidden layer based on a long-short-term memory LSTM model, or an implicit layer based on an LSTMP model, wherein the LSTMP model is an LSTM model with a cyclic projection layer.
  • the at least one temporal recursive layer comprises a hidden layer based on a feedforward sequence memory FSMN model, or a hidden layer based on a cFSMN model, wherein the cFSMN model is a compact FSMN model.
  • inputting the speaker audio into the memory depth neural network DNN comprises: extracting a second spectral feature from successive multi-frame speech of the speaker audio, and inputting the second spectral feature into the memory depth Neural network DNN.
  • the second spectral feature is a Mel scale filter bank FBank feature.
  • forming an acoustic feature of the speaker audio includes splicing the first spectral feature and the bottleneck feature to form the acoustic feature.
  • an apparatus for voiceprint recognition comprising:
  • a first extracting unit configured to extract a first spectral feature from the speaker audio
  • a second extracting unit configured to input the speaker audio into a memory depth neural network DNN to extract a bottleneck feature from a bottleneck layer of the memory deep neural network, wherein the memory deep neural network DNN includes at least one time recursion a layer and the bottleneck layer, an output of the at least one time recursive layer being connected to the bottleneck layer, the dimension of the bottleneck layer being smaller than a dimension of other hidden layers in the memory deep neural network DNN;
  • a feature combining unit configured to form an acoustic feature of the speaker audio based on the first spectral feature and the bottleneck feature;
  • a vector extracting unit configured to extract an identity authentication vector corresponding to the speaker audio based on the acoustic feature
  • the classification identifying unit is configured to perform speaker recognition using the classification model based on the identity authentication vector.
  • a computer readable storage medium having stored thereon a computer program for causing a computer to perform the method of the first aspect when the computer program is executed in a computer.
  • a computing device comprising a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, implementing the method of the first aspect .
  • a deep neural network DNN with memory function is designed, and a bottleneck feature with memory effect is extracted from the bottleneck layer of such a deep neural network, and is included in the acoustic feature.
  • acoustic features are more beneficial for reflecting the speaker's timing-related prosody features.
  • the identity authentication vector i-vector extracted based on such acoustic features can better characterize the speaker's speech traits, especially the prosodic features, so that the accuracy of speaker recognition is improved.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification
  • FIG. 2 shows a flow chart of a voiceprint recognition method in accordance with one embodiment
  • Figure 3 shows a schematic diagram of a bottleneck layer in a deep neural network
  • FIG. 5 shows a schematic structural diagram of a memory DNN according to another embodiment
  • Figure 6 shows a comparison of LSTM and LSTMP
  • FIG. 7 shows a schematic structural diagram of a memory DNN according to another embodiment
  • FIG. 9 shows a schematic structural diagram of a memory DNN according to another embodiment
  • Figure 10 shows a schematic block diagram of a voiceprint recognition device in accordance with one embodiment.
  • FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification.
  • the speaker speaks to form the speaker audio.
  • the speaker audio is input to the spectrum extraction unit, from which the basic spectral features are extracted by the spectrum extraction unit.
  • the speaker audio is also input to the Deep Neural Network (DNN).
  • the deep neural network DNN is a neural network with memory function and has a bottleneck layer, and correspondingly, the feature of the bottleneck layer is also characterized by a memory effect.
  • Bottleneck features are extracted from the bottleneck layer of the memory-enabled DNN, combined with basic spectral features to form acoustic features.
  • the acoustic feature is then input into an identity authentication vector i-vector model in which an acoustic statistic is calculated based on the acoustic features and based on the statistic i-vector, i-vector extraction, and speaker recognition. In this way, the result of the voiceprint recognition is output.
  • the voiceprint recognition method of this embodiment includes the following steps: Step 21, extracting spectral features from speaker audio; Step 22, inputting speaker audio into a memory deep neural network DNN, from the memory a bottleneck layer extraction bottleneck feature of a deep neural network DNN, wherein the memory deep neural network DNN includes at least one time recursive layer and the bottleneck layer, an output of the at least one time recursive layer being connected to the bottleneck layer,
  • the dimension of the bottleneck layer is smaller than the dimensions of the other layers in the memory deep neural network DNN; step 23, based on the basic spectral features and the bottleneck feature, forming an acoustic feature of the speaker audio; step 24, based on The acoustic feature is extracted, and an identity authentication vector corresponding to the speaker audio is extracted; and in step 25, speaker recognition is performed based on the identity authentication vector
  • spectral features are extracted from the speaker audio. It can be understood that the speaker audio is audio formed by the speaker speaking, and can be divided into a plurality of voice segments.
  • the spectral features extracted in step 21 are basic spectral features, in particular (single frame) short-term spectral features.
  • the spectral feature is a Mel-frequency cepstral coefficient.
  • the Mel frequency is based on the auditory characteristics of the human ear, which has a nonlinear relationship with the Hertz Hz frequency.
  • Extracting MFCC features from speaker audio generally includes the following steps: pre-emphasis, framing, windowing, Fourier transform, Meyer filter bank, discrete cosine transform (DCT), and the like.
  • the pre-emphasis is used to raise the high-frequency part to a certain extent to make the spectrum of the signal flat;
  • the framing is used to divide the speech into a series of frames by time;
  • the windowing step is to increase the continuity of the left and right ends of the frame by using a window function.
  • the audio is then Fourier transformed to convert the time domain signal to a frequency domain signal. Then, the frequency of the frequency domain signal is mapped to the Mel scale using the Mel filter bank, thereby obtaining the Mel spectrum. Then, the cepstral coefficient of the Mel spectrum is obtained by discrete cosine transform, and then the cepstrum mel spectrum can be obtained. Further, the dynamic cepstrum parameter MFCC can also be extracted from the dynamic difference parameter to obtain a differential feature, reflecting the dynamic variation characteristics between frames. Therefore, in general, based on the extracted MFCC features, first-order and second-order differential features are also obtained. For example, if the Mel cepstrum feature is characterized as 20 dimensions, then in the differential parameter extraction phase, a 20-dimensional first-order difference feature and a 20-dimensional second-order difference feature are also obtained, thereby forming a 60-dimensional vector.
  • the basic spectral features extracted in step 21 include a Linear Predictive Coding (LPC) feature or a Perceptual Linear Predictive (PLP) feature.
  • LPC Linear Predictive Coding
  • PLP Perceptual Linear Predictive
  • Conventional deep neural network DNN is an extension of traditional feedforward artificial neural network. It has more hidden layers and stronger expression ability. It has been applied in the field of speech recognition in recent years.
  • the deep neural network DNN replaces the GMM part of the Gaussian mixture model-hidden Markov model (GMM-HMM) acoustic model to characterize the emission probability of different states of the HMM model.
  • GMM-HMM Gaussian mixture model-hidden Markov model
  • the DNN for speech recognition has an acoustic feature of splicing multiple frames before and after, and the output layer usually uses a softmax function to characterize the posterior probability of predicting the HMM state phoneme, thereby classifying the phoneme state.
  • the deep neural network DNN has such a classification ability that it acquires a feature representation that facilitates a particular classification task through supervised data.
  • the bottleneck layer feature is a good representation of the above feature representation.
  • the bottleneck layer is a hidden layer in the DNN model that includes a number of nodes, or dimensions, that are significantly reduced compared to other hidden layers.
  • the bottleneck layer contains fewer nodes than other layers in the deep neural network DNN.
  • the number of other hidden layer nodes is 1024
  • the number of nodes in a certain layer is only 64, forming a DNN structure with a hidden layer topology of 1024-1024-64-1024-1024.
  • Figure 3 shows a schematic diagram of a bottleneck layer in a deep neural network.
  • the deep neural network includes a plurality of hidden layers, and the hidden layer whose number of nodes is significantly reduced compared with other hidden layers is the above bottleneck layer.
  • the improvement is based on the conventional deep neural network DNN. It introduces a memory function to form a deep neural network DNN with memory or memory.
  • the memory-equipped deep neural network DNN is designed to include at least one time recursive layer and a bottleneck layer, and the output of the time recursive layer is connected to the bottleneck layer, so that the feature of the bottleneck layer can reflect the time series feature, thereby having "memory function.
  • the speaker audio is input to the memory deep neural network DNN, and the memory bottleneck feature is extracted from the bottleneck layer of the memory deep neural network DNN.
  • the temporal recursion layer employs an implicit layer in a recurrent neural network (RNN). More specifically, in one embodiment, the time recursive layer uses a Long Short Time Memory (LSTM) model.
  • RNN recurrent neural network
  • LSTM Long Short Time Memory
  • the Circulating Neural Network RNN is a temporal recurrent neural network that can be used to process sequence data.
  • the current output of a sequence is associated with its previous output.
  • the RNN memorizes the previous information and applies it to the calculation of the current output, that is, the nodes between the hidden layers are connected, and the input of the hidden layer includes not only the output of the input layer but also the previous moment.
  • the output of this hidden layer can be expressed as:
  • RNN has long-term dependence problems, and training is difficult, such as the problem of gradient overflow.
  • the LSTM model proposed on the basis of RNN further solves the problem of long-term dependence.
  • LSTM model three gate calculations are implemented in a repeating network module, namely an input gate, an output gate, and a forget gate.
  • the setting of the forgetting gate allows the information to pass selectively, thereby discarding some information that is no longer needed, and thus judging and shielding the input unnecessary interference information, thereby better analyzing and processing the long-term data.
  • the deep neural network DNN includes an input layer, an output layer, and a plurality of hidden layers. These hidden layers include a time recursive layer formed by the LSTM model. The output of the LSTM layer is connected to the bottleneck layer, followed by a conventional hidden layer behind the bottleneck layer.
  • the bottleneck layer has significantly reduced dimensions, such as 64-dimensional, 128-dimensional.
  • the dimensions of other hidden layers are, for example, 1024 dimensions, 1500 dimensions, etc., which are significantly higher than the bottleneck layer.
  • the dimensions of the LSTM may be the same as or different from other conventional hidden layers, but are also significantly higher than the dimensions of the bottleneck layer.
  • the conventional hidden layer dimension is 1024
  • the LSTM layer dimension is 800
  • the bottleneck layer dimension is 64, thus forming a deep neural network DNN having a dimensional topology of 1024-1024-800-64-1024-1024.
  • FIG. 5 shows a schematic structural diagram of a memory DNN according to another embodiment.
  • the deep neural network DNN includes an input layer, an output layer and a plurality of hidden layers, and the hidden layer includes an LSTMP (LSTM projected) layer as a time recursive layer.
  • the LSTMP layer is the LSTM architecture with a looping projection layer.
  • Figure 6 shows a comparison of LSTM and LSTMP.
  • the conventional LSTM architecture the cyclic connection of the LSTM layer is implemented by the LSTM itself, that is, directly from the output unit to the input unit.
  • a separate linear projection layer is added after the LSTM layer.
  • the loop join is the input from the loop projection layer to the LSTM layer.
  • the number of nodes in the LSTM layer can be reduced by projection.
  • the output of the loop projection layer in the LSTMP layer is connected to the bottleneck layer, which is followed by a conventional hidden layer.
  • the bottleneck layer has a significantly reduced dimension
  • other hidden layers, including the LSTMP layer have significantly higher dimensions.
  • the conventional hidden layer dimension is 1024
  • the LSTM layer dimension is 800
  • the cyclic projection layer has a projection dimension reduction of 512, thus forming a deep neural network with a dimensional topology of 1024-800-512-64-1024-1024. DNN.
  • the input layer is directly connected to the LSTM/LSTMP temporal recursion layer, it will be appreciated that other conventional hidden layers may also be included before the temporal recursion layer.
  • the LSTM layer in Figure 7 can be replaced with an LSTMP layer. In another embodiment, more LSTM layers can be included in the DNN.
  • a time recursive layer is formed using a feedforward sequence memory model FSMN (Feedforward Sequential Memory Networks) in the deep neural network DNN.
  • the feedforward sequence memory model FSMN can be considered to have some learnable memory modules in the hidden layer of the standard feedforward fully connected neural network. These memory modules use the tapped delay line structure to encode long-term context information into fixed-size ones. Expression as a short-term memory mechanism. Therefore, the FSMN models the long-term dependency in the timing signal without the need for a feedback connection. For speech recognition, FSMN has better performance and the training process is simpler and more efficient.
  • cFSMN compact FSMN
  • the input data is first subjected to projection dimensionality reduction (for example, dimensionality reduction to 512) through the projection layer, then memory processing is performed by the memory model, and finally the memory processed feature data (for example, 1024 dimensions) is output.
  • projection dimensionality reduction for example, dimensionality reduction to 512
  • memory processing is performed by the memory model
  • the memory processed feature data for example, 1024 dimensions
  • the DNN can be made to have a memory function by introducing an FSMN model, or a cFSMN model, into the deep neural network DNN.
  • Figure 8 shows a block diagram of a memory DNN in accordance with one embodiment.
  • the deep neural network DNN includes an input layer, an output layer, and a plurality of hidden layers. These hidden layers include a time recursive layer formed by the FSMN model. The output of the FSMN layer is connected to the bottleneck layer, followed by the conventional hidden layer behind the bottleneck layer. The bottleneck layer has a significantly reduced dimension. Other hidden layers, including the above FSMN layer, have significantly higher dimensions than the bottleneck layer, such as 1024 dimensions, 1500 dimensions, and the like. In a typical example, the FSMN layer dimension is 1024, the other hidden layer dimensions are 2048, and the bottleneck layer dimension is 64. Thus, a deep neural network DNN having a dimensional topology of 2048-1024-64-2048-2048 is formed.
  • FIG. 9 shows a schematic structural diagram of a memory DNN according to another embodiment.
  • the deep neural network DNN includes an input layer, an output layer, and a plurality of hidden layers. These hidden layers include two cFSMN layers, the output of the latter cFSMN layer is connected to the bottleneck layer, and the bottleneck layer is followed by a conventional hidden layer. Similarly, the bottleneck layer has a significantly reduced dimension, and other hidden layers, including the 2 cFSMN layers, have significantly higher dimensions.
  • the conventional hidden layer dimension is 2048
  • the two cFSMN layer dimensions are 1024
  • the bottleneck layer dimension is 64, thus forming a deep neural network DNN having a dimensional topology of 2048-1024-1024-64-2048-2048.
  • cFSMN layer of FIG. 9 can also be replaced with the FSMN layer.
  • more FSMN/cFSMN layers may be included in the DNN.
  • time recursive models may also be employed in the deep neural network DNN to form a DNN with memory functionality.
  • the memory-enabled DNN includes one or more temporal recursive layers, and the temporal recursive layer is directly connected to the bottleneck layer, so that the characteristics of the bottleneck layer can reflect the timing effects and have a memory effect.
  • the more the number of time recursive layers the better the performance, but the higher the network complexity; the fewer the number of time recursive layers, the simpler the training of the network model.
  • a DNN with a number of time recursive layers between 1 and 5 layers is often employed.
  • the deep neural network DNN can be designed to have a time recursive layer before the bottleneck layer.
  • speech recognition training can be performed by a conventional method.
  • the bottleneck features included in the trained DNN can reflect more rich voice information than the basic spectral features.
  • the bottleneck feature also has a memory function, reflecting the timing impact of the voice.
  • the bottleneck layer feature in the above deep neural network DNN having a memory function is extracted, thereby acquiring a bottleneck feature with a memory function.
  • the speaker audio is input to the memory deep neural network DNN described above.
  • the continuous multi-frame speech in the speaker audio is input to the DNN.
  • the continuous multi-frame includes, for example, reviewing 10 historical frames, looking forward to 5 frames, and adding the current frame, for a total of 16 consecutive frames.
  • it is common to input its basic spectral features into the above memory DNN.
  • the basic spectral feature input to the DNN is the Mel spectral cepstral coefficient feature MFCC.
  • the basic spectral feature input to the DNN is a Mel-scale Filter Bank feature.
  • the basic spectral features input to the memory DNN are processed by the DNN to form a series of bottleneck features at the bottleneck layer.
  • the bottleneck layer contains a plurality of nodes of low dimensions that are assigned excitation values during the forward calculation of the DNN processing spectral features.
  • the above bottleneck feature is extracted by reading the excitation value of the bottleneck layer node.
  • step 21 the basic spectral features are extracted from the speaker audio
  • step 22 the bottleneck feature is extracted from the depth neural network DNN with memory function.
  • step 23 the spectral features described above and the bottleneck features are combined to form an acoustic feature of the speaker audio.
  • the bottleneck feature is spliced with the basic spectral features to form an acoustic feature of the speaker audio.
  • the i-vector model is based on the Gaussian mean supervector space represented by the Gaussian mixture model-universal background model. It is considered that both speaker information and channel information are contained in a low dimension. Among the subspaces. Given a speech, its Gaussian mean supervector M can be decomposed into the following form:
  • m is the speaker and channel-independent component, which can usually be replaced by the mean supervector of UBM; T is the overall variation subspace matrix; ⁇ is the variation factor that contains the speaker and channel information, ie i-vector.
  • speaker recognition is performed based on the above-identified identity authentication vector i-vector.
  • the extracted i-vector can be input as an identity feature into the classifier model for classification and speaker recognition.
  • the above classifier model is, for example, a probabilistic linear discriminant analysis (PLDA) model that calculates a likelihood ratio score between different i-vectors, and makes a decision based on the score.
  • the classifier model described above is, for example, a support vector machine (SVM) model.
  • SVM support vector machine
  • the model is a supervised classification algorithm that classifies i-vectors by finding a classification plane and separating the data on both sides of the plane to achieve the purpose of classification.
  • the temporal recursive layer may further comprise a hidden layer based on the feedforward sequence memory FSMN model, or a hidden layer based on the cFSMN model, wherein the cFSMN model is a compact FSMN model.
  • the second extraction unit 120 is configured to extract a second spectral feature from successive multi-frame speech of the speaker audio, and input the second spectral feature into the deep neural network DNN.
  • the second spectral feature is a Mel scale filter bank FBank feature.
  • the feature combining unit 130 is configured to stitch the first spectral feature and the bottleneck feature to form the acoustic feature.
  • a deep neural network DNN with memory function is designed, and the bottleneck feature with memory effect is extracted from the bottleneck layer of such deep neural network, which is included in the acoustic feature.
  • Such acoustic features are more beneficial for reflecting the speaker's timing-related prosody features.
  • the identity authentication vector i-vector extracted based on such acoustic features can better characterize the speaker's speech traits, especially the prosodic features, so that the accuracy of speaker recognition is improved.
  • a computer readable storage medium having stored thereon a computer program for causing a computer to perform the method described in connection with FIG. 2 when the computer program is executed in a computer.
  • a computing device comprising a memory and a processor, the memory storing executable code, and when the processor executes the executable code, implementing the method described in connection with FIG. 2 method.
  • the functions described herein can be implemented in hardware, software, firmware, or any combination thereof.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Neurology (AREA)
  • Computer Hardware Design (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种声纹识别的方法和装置,方法包括:从说话人音频中提取基本频谱特征(21);将说话人音频输入深度神经网络DNN,从DNN的瓶颈层提取瓶颈特征,其中DNN包括连接到瓶颈层的时间递归层(22);基于基本频谱特征和瓶颈特征,形成说话人音频的声学特征(23);基于声学特征,提取说话人音频对应的身份认证矢量i-vector(24);基于身份认证矢量i-vector,进行说话人识别(25)。

Description

基于记忆性瓶颈特征的声纹识别的方法及装置
相关申请的交叉引用
本专利申请要求于2018年2月12日提交的、申请号为201810146310.2、发明名称为“基于记忆性瓶颈特征的声纹识别的方法及装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及声纹识别。
背景技术
声纹是基于说话人的音波频谱特征而提取的声学特征。如同指纹一样,声纹作为一种生物特征,能够反映说话人的特质和身份信息。声纹识别又称为说话人识别,是利用语音信号中含有的特定说话人信息来自动识别说话人身份的一种生物认证技术。该生物认证技术在身份认证、安全核身等诸多领域和场景中都有广阔的应用前景。
身份认证矢量i-vector(identity vector)模型是声纹识别系统中常用的一种模型。I-vector模型认为,语音中的说话人和信道信息均包含于一个低维的线性子空间之中,每段语音可以用该低维空间中一个固定长度的矢量表征,该矢量即为身份认证矢量i-vector。身份认证矢量i-vector具备良好的区分度,包含说话人的身份特征信息,是声纹识别以及语音识别的重要特征。基于i-vector的声纹识别一般包含以下过程:基于频谱特征计算声学统计量,根据声学统计量提取身份认证矢量i-vector,然后基于i-vector进行说话人识别。由此,i-vector的提取非常重要。然而,现有的声纹识别过程中对i-vector的提取不够全面。因此,需要更有效的方案,获取更加全面的声纹特征,进一步提高声纹识别的准确度。
发明内容
本说明书一个或多个实施例描述了一种方法和装置,能够从说话人音频中获取更加全面的声学特征,从而使得身份认证矢量的提取更加全面,声纹识别的准确度得到提升。
根据第一方面,提供了一种声纹识别的方法,包括:从说话人音频中提取第一频谱 特征;将所述说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他隐含层的维度小;基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;基于所述声学特征,提取说话人音频对应的身份认证矢量;基于所述身份认证矢量,采用分类模型进行说话人识别。
在一个实施例中,上述第一频谱特征包括,梅尔频谱倒谱系数MFCC特征,以及所述MFCC特征的一阶差分特征和二阶差分特征。
在一种可能的设计中,上述至少一个时间递归层包括,基于长短期记忆LSTM模型的隐含层,或者基于LSTMP模型的隐含层,其中所述LSTMP模型为具有循环投影层的LSTM模型。
在另一种可能的设计中,上述至少一个时间递归层包括,基于前馈序列记忆FSMN模型的隐含层,或者基于cFSMN模型的隐含层,其中cFSMN模型为紧凑型FSMN模型。
根据一种实施方式,将说话人音频输入记忆性深度神经网络DNN包括:从所述说话人音频的连续多帧语音中提取第二频谱特征,将所述第二频谱特征输入所述记忆性深度神经网络DNN。
进一步地,在一个例子中,上述第二频谱特征为梅尔标度滤波器组FBank特征。
根据一个实施例,形成说话人音频的声学特征包括,将所述第一频谱特征和所述瓶颈特征进行拼接,从而形成所述声学特征。
根据第二方面,提供一种声纹识别的装置,包括:
第一提取单元,配置为从说话人音频中提取第一频谱特征;
第二提取单元,配置为将所述说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他隐含层的维度小;
特征组合单元,配置为基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;
矢量提取单元,配置为基于所述声学特征,提取说话人音频对应的身份认证矢量;
分类识别单元,配置为基于所述身份认证矢量,采用分类模型进行说话人识别。
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。
通过本说明书实施例提供的方法及装置,设计了具有记忆功能的深度神经网络DNN,从这样的深度神经网络的瓶颈层中提取出具有记忆效应的瓶颈特征,包含在声学特征中。这样的声学特征更有益于反映说话者的与时序相关的韵律特征。基于这样的声学特征所提取的身份认证矢量i-vector可以更好地表征说话人的语音特质,特别是韵律特征,从而使得说话人识别的准确性得到提高。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本说明书披露的一个实施例的实施场景示意图;
图2示出根据一个实施例的声纹识别方法的流程图;
图3示出深度神经网络中瓶颈层的示意图;
图4示出根据一个实施例的记忆性DNN的结构示意图;
图5示出根据另一实施例的记忆性DNN的结构示意图;
图6示出LSTM与LSTMP的对比;
图7示出根据另一实施例的记忆性DNN的结构示意图;
图8示出根据一个实施例的记忆性DNN的结构示意图;
图9示出根据另一实施例的记忆性DNN的结构示意图;
图10示出根据一个实施例的声纹识别装置的示意性框图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
图1为本说明书披露的一个实施例的实施场景示意图。首先,说话人讲话形成说话人音频。说话人音频被输入到频谱提取单元,由频谱提取单元从中提取出基本的频谱特征。另一方面,说话人音频还被输入到深度神经网络DNN(Deep Neural Network)。在图1的实施例中,该深度神经网络DNN是带有记忆功能的神经网络,并且具有瓶颈层,相应地,瓶颈层的特征也是带有记忆效应的特征。从该带有记忆功能的DNN的瓶颈层中提取出瓶颈特征,与基本的频谱特征结合在一起,形成声学特征。然后,将该声学特征输入到身份认证矢量i-vector模型中,在其中基于声学特征计算声学统计量,并基于所述统计量i-vector,从而进行i-vector提取,以及说话人识别。如此,输出声纹识别的结果。
图2示出根据一个实施例的声纹识别方法的流程图。该方法流程的执行主体可以是任何具有计算、处理能力的装置、设备或系统。如图2所示,该实施例的声纹识别方法包括以下步骤:步骤21,从说话人音频中提取频谱特征;步骤22,将说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络DNN的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他层的维度小;步骤23,基于所述基本频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;步骤24,基于所述声学特征,提取说话人音频对应的身份认证矢量;步骤25,基于所述身份认证矢量,进行说话人识别。以下描述上述各个步骤的具体执行过程。
首先,在步骤21,从说话人音频中提取频谱特征。可以理解,说话人音频是说话人讲话而形成的音频,可以被划分为多个语音段。步骤21提取的频谱特征为基本的频谱特征,特别是(单帧)短时频谱特征。
在一个实施例中,该频谱特征为梅尔频谱倒谱系数特征MFCC(Mel-frequency cepstral coefficient)。梅尔频率是基于人耳听觉特性提出来的,它与赫兹Hz频率成非线性对应关系。从说话人音频中提取MFCC特征一般包括以下步骤:预加重,分帧,加窗,傅里叶变换,梅尔滤波器组,离散余弦变换(DCT)等。其中预加重用于在一定程度提升高频部分,使信号的频谱变得平坦;分帧用于将语音按时间分成一系列帧;加窗步骤是采用窗 函数增加帧左端和右端的连续性。接着,将音频进行傅里叶变换,从而将时域信号转换为频域信号。然后,利用梅尔滤波器组将频域信号的频率对应到梅尔刻度上,从而获得梅尔频谱。之后,通过离散余弦变换获得梅尔频谱的倒谱系数,进而可以获得倒谱梅尔频谱。进一步地,还可以对标准的倒谱参数MFCC进行动态差分参数的提取,从而获取差分特征,反映帧之间的动态变化特性。因此,一般地,在提取MFCC特征的基础上,还获取其一阶、二阶差分特征。例如,如果梅尔倒谱特征表征为20维,那么在差分参数提取阶段,还获取20维的一阶差分特征,和20维的二阶差分特征,由此形成60维向量。
在另一个实施例中,步骤21提取的基本频谱特征包括线性预测编码LPC(Linear Predictive Coding)特征,或感知线性预测PLP(perceptual linear predictive)特征。这些特征可以通过常规方法提取。还有可能提取其他的短时频谱特征作为基本特征。
然而,基本的短时频谱特征往往不足以表达说话人的全面信息。例如,MFCC特征不能很好地反映高频域的说话人特征信息。因此,常规技术已经通过引入深度神经网络DNN的瓶颈(bottleneck)特征,作为全面声学特征的补充。
常规的深度神经网络DNN是传统前馈人工神经网络的扩展,具有更多的隐含层数和更强的表达能力,近年来在语音识别领域得到应用。在语音识别中,深度神经网络DNN代替了高斯混合模型-隐马尔可夫模型(Gaussian mixture model-hidden Markov model,GMM-HMM)声学模型中GMM的部分,来表征HMM模型不同状态的发射概率。一般地,用于语音识别的DNN,其输入为拼接了前后多帧的声学特征,输出层通常采用softmax函数表征预测HMM状态音素的后验概率,从而对音素状态进行分类。
深度神经网络DNN具有这样的分类能力,在于它通过有监督的数据,学习获取到了有利于特定分类任务的特征表示。在包含有瓶颈层的DNN中,瓶颈层特征就是上述特征表示的很好体现。具体地,瓶颈层是DNN模型中,所包含的节点数,或称维度,较其他隐含层明显减少的一个隐含层。或者说,瓶颈层包含的节点数目比深度神经网络DNN中其他层的节点数目都要少。例如,在一个深度神经网络DNN中,其他各个隐含层节点数目为1024,而某一层的节点数目只有64,构成一个隐含层拓扑为1024-1024-64-1024-1024的DNN结构,那么中间这个节点数目仅为64的隐含层就会被称为瓶颈层。图3示出深度神经网络中瓶颈层的示意图。如图3所示,深度神经网络包含多个隐含层,其中节点数目较其他隐含层显著减少的隐含层就是上述瓶颈层。
瓶颈层中节点的激励值可以看作是输入信号的一种低维表示,也称为瓶颈特征。在 用于语音识别而训练的DNN中,瓶颈特征可以包含更多说话人的语音信息。
在一个实施例中,为了更好地反映语音声学特征帧序列的上下文相关性,从而更好地抓住说话人音频中说话韵律的时序变化,在常规深度神经网络DNN的基础上进行改进,为其引入记忆功能,形成带记忆的或称记忆性的深度神经网络DNN。具体地,将带记忆的深度神经网络DNN设计为,包括至少一个时间递归层以及瓶颈层,上述时间递归层的输出连接到所述瓶颈层,如此使得瓶颈层的特征能够体现时序特征,从而具有“记忆”功能。于是,在步骤22,将说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络DNN的瓶颈层提取记忆性的瓶颈特征。
在一个实施例中,上述时间递归层采用循环神经网络RNN(Recurrent Neural Networks)中的隐含层。更具体地,在一个实施例中,上述时间递归层采用长短期记忆LSTM(Long Short Time Memory)模型。
循环神经网络RNN是一种时间递归神经网络,可用于处理序列数据。在RNN中,一个序列当前的输出与其前面的输出相关联。具体的,RNN会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层之间的节点是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻该隐含层的输出。也就是说,第t次的隐含层状态可以表示为:
St=f(U*Xt+W*St-1)
其中,Xt为第t次输入层的状态,St-1为第t-1次隐含层状态,f为计算函数,W,U为权重。如此,RNN将之前的状态循环回当前输入,从而考虑输入序列的时序影响。
在处理长期记忆的情况下,RNN存在长期依赖问题,训练比较困难,例如容易发生梯度溢出的问题。在RNN基础上提出的LSTM模型进一步解决了该长期依赖的问题。
根据LSTM模型,在重复网络模块中实现三个门计算,即输入门(input gate)、输出门(output gate)和遗忘门(forget gate)。遗忘门的设置可以让信息选择性通过,以此丢弃某些不再需要的信息,如此对输入的不必要的干扰信息进行判断和屏蔽,从而更好地对长期数据进行分析处理。
图4示出根据一个实施例的记忆性DNN的结构示意图。如图4所示,该深度神经网络DNN包含输入层,输出层和若干隐含层。这些隐含层中包括通过LSTM模型形成的时间递归层。该LSTM层的输出连接到瓶颈层,在瓶颈层之后是常规的隐含层。瓶颈层具有显著减小的维度,例如64维,128维。其他隐含层的维度例如是1024维,1500 维等等,均显著高于瓶颈层。LSTM的维度与其他常规隐含层可以相同也可以不同,但是同样地显著高于瓶颈层的维度。在一个典型例子中,常规隐含层维度为1024,LSTM层维度为800,瓶颈层维度为64,如此形成维度拓扑为1024-1024-800-64-1024-1024的深度神经网络DNN。
图5示出根据另一实施例的记忆性DNN的结构示意图。如图5所示,该深度神经网络DNN包含输入层,输出层和若干隐含层,隐含层中包括LSTMP(LSTM projected)层作为时间递归层。LSTMP层即具有循环投影层的LSTM架构。图6示出LSTM与LSTMP的对比。在常规LSTM架构中,LSTM层的循环连接由LSTM自身实现,也就是从输出单元直接连接到输入单元。而在LSTMP架构下,在LSTM层之后增加一个单独的线性投影层。这样,循环连接是从该循环投影层到LSTM层的输入。通过设置循环投影层中单元的数目,可以对LSTM层的节点数目进行投影降维。
LSTMP层中循环投影层的输出连接到瓶颈层,在瓶颈层之后是常规的隐含层。类似的,瓶颈层具有显著减小的维度,其他隐含层,包括LSTMP层,具有显著高的维度。在一个典型例子中,常规隐含层维度为1024,LSTM层维度为800,循环投影层的投影降维为512,如此形成维度拓扑为1024-800-512-64-1024-1024的深度神经网络DNN。
尽管在图4和图5的例子中,输入层直接连接到LSTM/LSTMP时间递归层,但是可以理解,也可以在时间递归层之前包含其他常规隐含层。
图7示出根据另一实施例的记忆性DNN的结构示意图。如图7所示,该深度神经网络DNN包含输入层,输出层和若干隐含层。这些隐含层中包括2个LSTM层,后一LSTM层的输出连接到瓶颈层,在瓶颈层之后是常规的隐含层。类似的,瓶颈层具有显著减小的维度,其他隐含层,包括2个LSTM层,具有显著高的维度。在一个典型例子中,常规隐含层维度为1024,2个LSTM层维度为800,瓶颈层维度为64,如此形成维度拓扑为1024-800-800-64-1024-1024的深度神经网络DNN。
在一个实施例中,可以将图7中的LSTM层替换为LSTMP层。在另一实施例中,可以在DNN中包含更多的LSTM层。
在一个实施例中,在深度神经网络DNN中采用前馈序列记忆模型FSMN(Feedforward Sequential Memory Networks)形成时间递归层。前馈序列记忆模型FSMN可以认为是在标准的前馈全连接神经网络的隐含层中设置了一些可学习的记忆模块,这些记忆模块使用抽头延迟线结构将长时上下文信息编码成固定大小的表达作为一种短 时记忆机制。因此,FSMN对时序信号中的长时相关性(long-term dependency)进行建模而不需要使用反馈连接。针对语音识别来说,FSMN具有较佳的性能,并且训练过程更加简单高效。
在FSMN基础上,还提出了紧凑型FSMN,即cFSMN(compact FSMN)。cFSMN具有更加简化的模型结构。在cFSMN中,首先通过投影层对输入数据进行投影降维(例如降维到512),然后通过记忆模型进行记忆处理,最后输出经记忆处理的特征数据(例如1024维)。
可以通过在深度神经网络DNN中引入FSMN模型,或者cFSMN模型,使得DNN带有记忆功能。
图8示出根据一个实施例的记忆性DNN的结构示意图。如图8所示,该深度神经网络DNN包含输入层,输出层和若干隐含层。这些隐含层中包括通过FSMN模型形成的时间递归层。该FSMN层的输出连接到瓶颈层,在瓶颈层之后是常规的隐含层。瓶颈层具有显著减小的维度。其他隐含层,包括上述FSMN层,维度均显著高于瓶颈层,例如是1024维,1500维等等。在一个典型例子中,FSMN层维度为1024,其他隐含层维度为2048,瓶颈层维度为64,如此形成维度拓扑为2048-1024-64-2048-2048的深度神经网络DNN。
图9示出根据另一实施例的记忆性DNN的结构示意图。如图9所示,该深度神经网络DNN包含输入层,输出层和若干隐含层。这些隐含层中包括2个cFSMN层,后一cFSMN层的输出连接到瓶颈层,在瓶颈层之后是常规的隐含层。类似的,瓶颈层具有显著减小的维度,其他隐含层,包括2个cFSMN层,具有显著高的维度。在一个典型例子中,常规隐含层维度为2048,2个cFSMN层维度为1024,瓶颈层维度为64,如此形成维度拓扑为2048-1024-1024-64-2048-2048的深度神经网络DNN。
可以理解,也可以将图9的cFSMN层替换为FSMN层。在另一实施例中,可以在DNN中包含更多的FSMN/cFSMN层。
在一个实施例中,还可以在深度神经网络DNN中采用其他时间递归的模型,从而形成带记忆功能的DNN。总体而言,带记忆功能的DNN中包括一个或多个时间递归层,并且时间递归层直接连接到瓶颈层,如此使得瓶颈层的特征能够反映时序影响,具有记忆效应。可以理解,时间递归层的数目越多,性能越好,但是网络复杂度越高;时间递归层的数目越少,网络模型的训练越简单。典型地,常常采用时间递归层的数目在1层 到5层之间的DNN。
通过以上描述可以看到,深度神经网络DNN可以被设计为,在瓶颈层之前具有时间递归层。对于这样的带记忆功能的深度神经网络DNN,可以用常规方法对其进行语音识别的训练。经过训练的DNN所包含的瓶颈特征可以反映比基本频谱特征更加丰富的语音信息。并且由于该DNN在瓶颈层之前具有时间递归层,使得瓶颈特征也带有记忆功能,反映语音的时序影响。相应地,在图2的步骤22中,对以上具有记忆功能的深度神经网络DNN中的瓶颈层特征进行提取,从而获取带有记忆功能的瓶颈特征。
具体地,在步骤22,将说话人音频输入上述的记忆性深度神经网络DNN。在一个实施例中,将上述说话人音频中的连续多帧语音输入DNN,上述连续多帧例如包括回看10个历史帧,前看5帧,加上当前帧,一共为连续的16帧。对于这连续多帧语音,通常是将其基本频谱特征输入以上记忆性DNN。在一个实施例中,输入到DNN的基本频谱特征为梅尔频谱倒谱系数特征MFCC。在另一实施例中,输入到DNN的基本频谱特征为,梅尔标度滤波器组FBank(Mel-scale Filter Bank)特征。FBank特征是利用梅尔滤波器组将频域信号的频率对应到梅尔刻度上获得的频谱特征。换而言之,MFCC特征在FBank特征的基础上进行了进一步的离散余弦变换,FBank特征是MFCC在离散余弦变换之前的特征。
输入到记忆性DNN的基本频谱特征经过DNN的计算处理,在瓶颈层形成一系列瓶颈特征。具体而言,瓶颈层包含低维度的多个节点,这些节点在DNN处理频谱特征的前向计算过程中被赋予激励值。通过读取瓶颈层节点的激励值,提取上述瓶颈特征。
由此,在步骤21,从说话人音频中提取了基本的频谱特征,在步骤22,从带记忆功能的深度神经网络DNN中提取了瓶颈特征。基于此,在步骤23,将上述频谱特征和所述瓶颈特征进行结合,形成说话人音频的声学特征。在一个实施例中,将瓶颈特征与基本频谱特征进行拼接,从而形成说话人音频的声学特征。
例如,假定基本频谱特征包括20维的MFCC特征,以及各20维的MFCC一阶差分特征,二阶差分特征,瓶颈特征如瓶颈层维度一样,例如为64维,那么可以将以上60维的MFCC及其差分特征,和64维的瓶颈特征进行拼接,得到124维的向量作为声学特征Ot。当然,在其他例子中,声学特征Ot还有可能包含基于其他因素获取的更多特征。
接着,在步骤24,基于以上获取的声学特征,提取说话人音频对应的身份认证矢量, 即i-vector。
i-vector模型建立在高斯混合模型-通用背景模型GMM-UBM(Gaussian mixture model-universal background model)所表征的高斯均值超矢量空间之上,它认为说话人信息和信道信息同时蕴含于一个低维的子空间之中。给定一段语音,其高斯均值超矢量M可以分解为如下形式:
M=m+Tω
其中,m是说话人和信道无关分量,通常可以采用UBM的均值超矢量来代替;T是总体变化子空间矩阵;ω是包含了说话人和信道信息的变化因子,即i-vector。
为了计算和提取i-vector,需要计算各语音段的充分统计量(即Baum-Welch统计量):
Figure PCTCN2019073101-appb-000001
Figure PCTCN2019073101-appb-000002
Figure PCTCN2019073101-appb-000003
其中,
Figure PCTCN2019073101-appb-000004
Figure PCTCN2019073101-appb-000005
分别表示语音段k在第c个GMM混合分量上的零阶统计量、一阶统计量和二阶统计量,
Figure PCTCN2019073101-appb-000006
表示语音段k在时间索引t处的声学特征,μ c是GMM第c个混合分量的均值,
Figure PCTCN2019073101-appb-000007
表示声学特征
Figure PCTCN2019073101-appb-000008
对第c个GMM混合分量的后验概率。基于以上充分统计量,进行i-vector的映射和提取。
可以看到,i-vector的提取基于以上充分统计量的计算,而以上充分统计量的计算都是基于声学特征
Figure PCTCN2019073101-appb-000009
根据图2的步骤21-23,声学特征
Figure PCTCN2019073101-appb-000010
不仅包含了说话人音频的基本频谱信息,还包含了带记忆功能的深度神经网络DNN的瓶颈特征。如此,这样的声学特征
Figure PCTCN2019073101-appb-000011
可以更好地表征语音段的韵律信息,相应地,基于该声学特征
Figure PCTCN2019073101-appb-000012
所提取的i-vector可以更全面地表征说话人的语音特质。
接着,在步骤25,基于以上提取的身份认证矢量i-vector,进行说话人识别。具体地,可以将提取的i-vector作为身份特征,输入到分类器模型中,进行分类和说话人识别。上述分类器模型例如是概率线性鉴别性分析PLDA(probabilistic linear discriminant analysis)模型,该模型计算不同i-vector之间的似然比分数,根据该分数做出判决。在另一例子中,上述分类器模型例如是支持向量机SVM(support vector machine)模型。该 模型是一个有监督的分类算法,通过找到一个分类平面,将数据分隔在平面两侧,从而达到分类的目的,从而实现对i-vector进行分类。
由上可知,由于声学特征中包含了记忆性的瓶颈特征,从而更好地表征语音段的韵律信息,相应地,基于该声学特征所提取的i-vector更全面地表征说话人的语音特质,进而,基于这样的i-vector所进行的说话人识别具有更高的识别准确性。
根据另一方面,本说明书的实施例还提供了一种声纹识别的装置。图10示出根据一个实施例的声纹识别装置的示意性框图。如图10所示,该装置100包括:第一提取单元110,配置为从说话人音频中提取第一频谱特征;第二提取单元120,配置为将所述说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他隐含层的维度小;特征组合单元130,配置为基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;矢量提取单元140,配置为基于所述声学特征,提取说话人音频对应的身份认证矢量;分类识别单元150,配置为基于所述身份认证矢量,采用分类模型进行说话人识别。
根据一个实施例,上述第一提取单元110所提取的第一频谱特征包括,梅尔频谱倒谱系数MFCC特征,以及所述MFCC特征的一阶差分特征和二阶差分特征。
在一个实施例中,第二提取单元120所基于的记忆性深度神经网络DNN中的时间递归层包括,基于长短期记忆LSTM模型的隐含层,或者基于LSTMP模型的隐含层,其中所述LSTMP模型为具有循环投影层的LSTM模型。
在另一实施例中,时间递归层还可以包括,基于前馈序列记忆FSMN模型的隐含层,或者基于cFSMN模型的隐含层,其中cFSMN模型为紧凑型FSMN模型。
在一个实施例中,上述第二提取单元120配置为:从所述说话人音频的连续多帧语音中提取第二频谱特征,将所述第二频谱特征输入所述深度神经网络DNN。
进一步地,在一个例子中,上述第二频谱特征为梅尔标度滤波器组FBank特征。
在一个实施例中,所述特征组合单元130配置为,将所述第一频谱特征和所述瓶颈特征进行拼接,从而形成所述声学特征。
通过以上描述的方法及装置,设计了具有记忆功能的深度神经网络DNN,从这样的深度神经网络的瓶颈层中提取出具有记忆效应的瓶颈特征,包含在声学特征中。这样的 声学特征更有益于反映说话者的与时序相关的韵律特征。基于这样的声学特征所提取的身份认证矢量i-vector可以更好地表征说话人的语音特质,特别是韵律特征,从而使得说话人识别的准确性得到提高。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (16)

  1. 一种声纹识别的方法,包括:
    从说话人音频中提取第一频谱特征;
    将所述说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他隐含层的维度小;
    基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;
    基于所述声学特征,提取说话人音频对应的身份认证矢量;
    基于所述身份认证矢量,采用分类模型进行说话人识别。
  2. 根据权利要求1所述的方法,其中所述第一频谱特征包括,梅尔频谱倒谱系数MFCC特征,以及所述MFCC特征的一阶差分特征和二阶差分特征。
  3. 根据权利要求1所述的方法,其中所述至少一个时间递归层包括,基于长短期记忆LSTM模型的隐含层,或者基于LSTMP模型的隐含层,其中所述LSTMP模型为具有循环投影层的LSTM模型。
  4. 根据权利要求1所述的方法,其中所述至少一个时间递归层包括,基于前馈序列记忆FSMN模型的隐含层,或者基于cFSMN模型的隐含层,其中cFSMN模型为紧凑型FSMN模型。
  5. 根据权利要求1所述的方法,其中将所述说话人音频输入记忆性深度神经网络DNN包括:从所述说话人音频的连续多帧语音中提取第二频谱特征,将所述第二频谱特征输入所述记忆性深度神经网络DNN。
  6. 根据权利要求5所述的方法,其中所述第二频谱特征为梅尔标度滤波器组FBank特征。
  7. 根据权利要求1所述的方法,其中基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征包括,将所述第一频谱特征和所述瓶颈特征进行拼接,从而形成所述声学特征。
  8. 一种声纹识别的装置,包括:
    第一提取单元,配置为从说话人音频中提取第一频谱特征;
    第二提取单元,配置为将所述说话人音频输入记忆性深度神经网络DNN,从所述记忆性深度神经网络的瓶颈层提取瓶颈特征,其中所述记忆性深度神经网络DNN包括至少一个时间递归层和所述瓶颈层,所述至少一个时间递归层的输出连接到所述瓶颈层,所述瓶颈层的维度比所述记忆性深度神经网络DNN中其他隐含层的维度小;
    特征组合单元,配置为基于所述第一频谱特征和所述瓶颈特征,形成所述说话人音频的声学特征;
    矢量提取单元,配置为基于所述声学特征,提取说话人音频对应的身份认证矢量;
    分类识别单元,配置为基于所述身份认证矢量,采用分类模型进行说话人识别。
  9. 根据权利要求8所述的装置,其中所述第一频谱特征包括,梅尔频谱倒谱系数MFCC特征,以及所述MFCC特征的一阶差分特征和二阶差分特征。
  10. 根据权利要求8所述的装置,其中所述至少一个时间递归层包括,基于长短期记忆LSTM模型的隐含层,或者基于LSTMP模型的隐含层,其中所述LSTMP模型为具有循环投影层的LSTM模型。
  11. 根据权利要求8所述的装置,其中所述至少一个时间递归层包括,基于前馈序列记忆FSMN模型的隐含层,或者基于cFSMN模型的隐含层,其中cFSMN模型为紧凑型FSMN模型。
  12. 根据权利要求8述的装置,其中所述第二提取单元配置为:从所述说话人音频的连续多帧语音中提取第二频谱特征,将所述第二频谱特征输入所述记忆性深度神经网络DNN。
  13. 根据权利要求12所述的装置,其中所述第二频谱特征为梅尔标度滤波器组FBank特征。
  14. 根据权利要求8所述的装置,其中所述特征组合单元配置为,将所述第一频谱特征和所述瓶颈特征进行拼接,从而形成所述声学特征。
  15. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-7中任一项的所述的方法。
  16. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-7中任一项所述的方法。
PCT/CN2019/073101 2018-02-12 2019-01-25 基于记忆性瓶颈特征的声纹识别的方法及装置 WO2019154107A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP19750725.4A EP3719798B1 (en) 2018-02-12 2019-01-25 Voiceprint recognition method and device based on memorability bottleneck feature
SG11202006090RA SG11202006090RA (en) 2018-02-12 2019-01-25 Voiceprint Recognition Method And Device Based On Memory Bottleneck Feature
EP21198758.1A EP3955246B1 (en) 2018-02-12 2019-01-25 Voiceprint recognition method and device based on memory bottleneck feature
US16/905,354 US20200321008A1 (en) 2018-02-12 2020-06-18 Voiceprint recognition method and device based on memory bottleneck feature

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810146310.2A CN108447490B (zh) 2018-02-12 2018-02-12 基于记忆性瓶颈特征的声纹识别的方法及装置
CN201810146310.2 2018-02-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/905,354 Continuation US20200321008A1 (en) 2018-02-12 2020-06-18 Voiceprint recognition method and device based on memory bottleneck feature

Publications (1)

Publication Number Publication Date
WO2019154107A1 true WO2019154107A1 (zh) 2019-08-15

Family

ID=63192672

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073101 WO2019154107A1 (zh) 2018-02-12 2019-01-25 基于记忆性瓶颈特征的声纹识别的方法及装置

Country Status (6)

Country Link
US (1) US20200321008A1 (zh)
EP (2) EP3719798B1 (zh)
CN (1) CN108447490B (zh)
SG (1) SG11202006090RA (zh)
TW (1) TW201935464A (zh)
WO (1) WO2019154107A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951256A (zh) * 2021-01-25 2021-06-11 北京达佳互联信息技术有限公司 语音处理方法及装置
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108447490B (zh) * 2018-02-12 2020-08-18 阿里巴巴集团控股有限公司 基于记忆性瓶颈特征的声纹识别的方法及装置
KR102637339B1 (ko) * 2018-08-31 2024-02-16 삼성전자주식회사 음성 인식 모델을 개인화하는 방법 및 장치
CN109036467B (zh) * 2018-10-26 2021-04-16 南京邮电大学 基于tf-lstm的cffd提取方法、语音情感识别方法及系统
JP7024691B2 (ja) * 2018-11-13 2022-02-24 日本電信電話株式会社 非言語発話検出装置、非言語発話検出方法、およびプログラム
US11315550B2 (en) * 2018-11-19 2022-04-26 Panasonic Intellectual Property Corporation Of America Speaker recognition device, speaker recognition method, and recording medium
CN109360553B (zh) * 2018-11-20 2023-06-20 华南理工大学 一种用于语音识别的时延递归神经网络
CN109754812A (zh) * 2019-01-30 2019-05-14 华南理工大学 一种基于卷积神经网络的防录音攻击检测的声纹认证方法
WO2020199013A1 (en) * 2019-03-29 2020-10-08 Microsoft Technology Licensing, Llc Speaker diarization with early-stop clustering
KR102294638B1 (ko) * 2019-04-01 2021-08-27 한양대학교 산학협력단 잡음 환경에 강인한 화자 인식을 위한 심화 신경망 기반의 특징 강화 및 변형된 손실 함수를 이용한 결합 학습 방법 및 장치
CN112333545B (zh) * 2019-07-31 2022-03-22 Tcl科技集团股份有限公司 一种电视内容推荐方法、系统、存储介质和智能电视
CN110379412B (zh) 2019-09-05 2022-06-17 腾讯科技(深圳)有限公司 语音处理的方法、装置、电子设备及计算机可读存储介质
CN111028847B (zh) * 2019-12-17 2022-09-09 广东电网有限责任公司 一种基于后端模型的声纹识别优化方法和相关装置
CN111354364B (zh) * 2020-04-23 2023-05-02 上海依图网络科技有限公司 一种基于rnn聚合方式的声纹识别方法与系统
CN111653270B (zh) * 2020-08-05 2020-11-20 腾讯科技(深圳)有限公司 语音处理方法、装置、计算机可读存储介质及电子设备
CN112241467A (zh) * 2020-12-18 2021-01-19 北京爱数智慧科技有限公司 一种音频查重的方法和装置
TWM619473U (zh) * 2021-01-13 2021-11-11 神盾股份有限公司 語音助理系統
CN112992126B (zh) * 2021-04-22 2022-02-25 北京远鉴信息技术有限公司 语音真伪的验证方法、装置、电子设备及可读存储介质
CN113284508B (zh) * 2021-07-21 2021-11-09 中国科学院自动化研究所 基于层级区分的生成音频检测系统
CN114333900B (zh) * 2021-11-30 2023-09-05 南京硅基智能科技有限公司 端到端提取bnf特征的方法、网络模型、训练方法及系统
CN114882906A (zh) * 2022-06-30 2022-08-09 广州伏羲智能科技有限公司 一种新型环境噪声识别方法及系统
CN116072123B (zh) * 2023-03-06 2023-06-23 南昌航天广信科技有限责任公司 广播信息播放方法、装置、可读存储介质及电子设备
CN117238320B (zh) * 2023-11-16 2024-01-09 天津大学 一种基于多特征融合卷积神经网络的噪声分类方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (zh) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 一种声纹识别方法和装置
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
CN106448684A (zh) * 2016-11-16 2017-02-22 北京大学深圳研究生院 基于深度置信网络特征矢量的信道鲁棒声纹识别系统
CN106875942A (zh) * 2016-12-28 2017-06-20 中国科学院自动化研究所 基于口音瓶颈特征的声学模型自适应方法
CN106952644A (zh) * 2017-02-24 2017-07-14 华南理工大学 一种基于瓶颈特征的复杂音频分割聚类方法
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN108447490A (zh) * 2018-02-12 2018-08-24 阿里巴巴集团控股有限公司 基于记忆性瓶颈特征的声纹识别的方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9324320B1 (en) * 2014-10-02 2016-04-26 Microsoft Technology Licensing, Llc Neural network-based speech processing
CN107492382B (zh) * 2016-06-13 2020-12-18 阿里巴巴集团控股有限公司 基于神经网络的声纹信息提取方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971690A (zh) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 一种声纹识别方法和装置
CN105575394A (zh) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 基于全局变化空间及深度学习混合建模的声纹识别方法
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN106448684A (zh) * 2016-11-16 2017-02-22 北京大学深圳研究生院 基于深度置信网络特征矢量的信道鲁棒声纹识别系统
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN106875942A (zh) * 2016-12-28 2017-06-20 中国科学院自动化研究所 基于口音瓶颈特征的声学模型自适应方法
CN106952644A (zh) * 2017-02-24 2017-07-14 华南理工大学 一种基于瓶颈特征的复杂音频分割聚类方法
CN108447490A (zh) * 2018-02-12 2018-08-24 阿里巴巴集团控股有限公司 基于记忆性瓶颈特征的声纹识别的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3719798A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11899765B2 (en) 2019-12-23 2024-02-13 Dts Inc. Dual-factor identification system and method with adaptive enrollment
CN112951256A (zh) * 2021-01-25 2021-06-11 北京达佳互联信息技术有限公司 语音处理方法及装置

Also Published As

Publication number Publication date
CN108447490A (zh) 2018-08-24
EP3719798A1 (en) 2020-10-07
EP3719798A4 (en) 2021-03-24
EP3955246A1 (en) 2022-02-16
CN108447490B (zh) 2020-08-18
EP3955246B1 (en) 2023-03-29
EP3719798B1 (en) 2022-09-21
US20200321008A1 (en) 2020-10-08
TW201935464A (zh) 2019-09-01
SG11202006090RA (en) 2020-07-29

Similar Documents

Publication Publication Date Title
WO2019154107A1 (zh) 基于记忆性瓶颈特征的声纹识别的方法及装置
Kabir et al. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities
US10699699B2 (en) Constructing speech decoding network for numeric speech recognition
US10923111B1 (en) Speech detection and speech recognition
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
Ahmad et al. A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network
US7957959B2 (en) Method and apparatus for processing speech data with classification models
US7693713B2 (en) Speech models generated using competitive training, asymmetric training, and data boosting
EP3469582A1 (en) Neural network-based voiceprint information extraction method and apparatus
Nayana et al. Comparison of text independent speaker identification systems using GMM and i-vector methods
US8447614B2 (en) Method and system to authenticate a user and/or generate cryptographic data
CN112102815A (zh) 语音识别方法、装置、计算机设备和存储介质
Todkar et al. Speaker recognition techniques: A review
Liao et al. Incorporating symbolic sequential modeling for speech enhancement
Beigi Speaker recognition: Advancements and challenges
Mohammed et al. Advantages and disadvantages of automatic speaker recognition systems
Rozario et al. Performance comparison of multiple speech features for speaker recognition using artifical neural network
Sas et al. Gender recognition using neural networks and ASR techniques
Musaev et al. Advanced feature extraction method for speaker identification using a classification algorithm
Lavania et al. Reviewing Human-Machine Interaction through Speech Recognition approaches and Analyzing an approach for Designing an Efficient System
Nair et al. A reliable speaker verification system based on LPCC and DTW
Bhukya et al. Automatic speaker verification spoof detection and countermeasures using gaussian mixture model
Zhao et al. Research on x-vector speaker recognition algorithm based on Kaldi
JP2000259198A (ja) パターン認識装置および方法、並びに提供媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19750725

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019750725

Country of ref document: EP

Effective date: 20200629

NENP Non-entry into the national phase

Ref country code: DE