WO2020177380A1 - Voiceprint detection method, apparatus and device based on short text, and storage medium - Google Patents

Voiceprint detection method, apparatus and device based on short text, and storage medium Download PDF

Info

Publication number
WO2020177380A1
WO2020177380A1 PCT/CN2019/117731 CN2019117731W WO2020177380A1 WO 2020177380 A1 WO2020177380 A1 WO 2020177380A1 CN 2019117731 W CN2019117731 W CN 2019117731W WO 2020177380 A1 WO2020177380 A1 WO 2020177380A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
voice signal
neural network
vector
deep neural
Prior art date
Application number
PCT/CN2019/117731
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
周新宇
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020177380A1 publication Critical patent/WO2020177380A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the field of information technology, and in particular to a short text-based voiceprint detection method, device, equipment and storage medium.
  • Voiceprint detection is a common and effective identification method, which can be applied to a series of scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, survival authentication, and Internet of Things device verification. It is especially inconvenient to use video image verification In the remote verification, there is no device restriction at all.
  • identity authentication such as online payment, voiceprint lock control, survival authentication, and Internet of Things device verification.
  • video image verification In the remote verification, there is no device restriction at all.
  • the embodiments of the present application provide a short text-based voiceprint detection method, device, equipment, and storage medium to solve the problems of long voice signals, large sample information, and high computing resource requirements in the existing voiceprint detection methods.
  • a voiceprint detection method based on short text including:
  • the Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal.
  • Each element in the voiceprint vector represents the feature of the voice signal;
  • the training samples and speech signals are both short texts.
  • the acquiring training samples and using the training samples to train a preset deep neural network includes:
  • the Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;
  • the mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
  • the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Discard strategy for training.
  • the comparing the voiceprint vector of the voice signal with a pre-stored voiceprint vector in a voiceprint model library, and outputting a voiceprint detection result according to the comparison result includes:
  • the preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient includes:
  • the discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
  • a voiceprint detection device based on short text including:
  • the training module is used to obtain training samples, and use the training samples to train a preset deep neural network
  • the signal acquisition module is used to acquire the voice signal to be recognized
  • the feature extraction module is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
  • the feature acquisition module is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the voice signal A voiceprint vector, where each element in the voiceprint vector represents a feature of the voice signal;
  • the detection module is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;
  • the training samples and speech signals are both short texts.
  • the detection module includes:
  • the comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library
  • the first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;
  • the second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
  • the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Discard strategy for training.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:
  • the Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal.
  • Each element in the voiceprint vector represents the feature of the voice signal;
  • the training samples and speech signals are both short texts.
  • One or more non-volatile readable storage media storing computer readable instructions.
  • the computer readable instructions execute the following steps:
  • the Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal.
  • Each element in the voiceprint vector represents the feature of the voice signal;
  • the training samples and speech signals are both short texts.
  • FIG. 1 is a flowchart of a voiceprint detection method based on short text in an embodiment of the present application
  • step S101 in the voiceprint detection method based on short text in an embodiment of the present application
  • step S103 is a flowchart of step S103 in the voiceprint detection method based on short text in an embodiment of the present application
  • step S105 is a flowchart of step S105 in the short text-based voiceprint detection method in an embodiment of the present application
  • FIG. 5 is a functional block diagram of a voiceprint detection device based on short text in an embodiment of the present application
  • Fig. 6 is a schematic diagram of a computer device in an embodiment of the present application.
  • the voiceprint detection method based on short text provided by the embodiment of the present application is applied to a server.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for voiceprint detection based on short text is provided, which includes the following steps:
  • step S101 a training sample is obtained, and the training sample is used to train a preset deep neural network.
  • the embodiment of the application redesigned a deep neural network suitable for short text.
  • the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer.
  • Each fully connected layer is a 12-dimensional input and is excited by maxout. Function, and the third fully connected layer and the fourth fully connected layer are trained using a discard strategy.
  • the deep neural network is not limited by the model structure, and can use short texts as training samples and input vectors, thereby reducing data requirements.
  • short text refers to a voice signal with a shorter length.
  • the short text may be specified by a length, such as a voice signal less than or equal to the specified length.
  • voice samples of multiple users are collected as training samples, and a preset deep neural network is trained based on the training samples.
  • the step S101 includes:
  • step S201 voice samples of multiple users are obtained as training samples.
  • voice samples corresponding to multiple users can be collected in advance in specific application scenarios.
  • voice samples corresponding to each user can be collected through channels such as professional knowledge bases, network databases, etc., as training samples.
  • step S202 the training samples of each user are preprocessed, and feature extraction is performed on the preprocessed training samples to obtain MFCC features.
  • the MFCC feature (Mel-scale Frequency Cepstral Coefficients, MFCC for short) is a recognizable component in the speech signal. It is a cepstral parameter extracted in the frequency domain of the Mel scale. Taking into account the human ear's perception of different frequencies, it is especially suitable for speech recognition and speaker recognition.
  • the embodiment of the present application designs a deep neural network based on the MFCC feature, and uses the MFCC feature as the input of the deep neural network. Before training the deep neural network, first perform preprocessing and feature extraction on the user samples to obtain corresponding MFCC features. The preprocessing and feature extraction of the training samples of the user are the same as step S103. For details, please refer to the description of step S103, which will not be repeated here.
  • a set of 128-dimensional MFCC features corresponding to the training sample is obtained by performing feature extraction on the preprocessed training sample.
  • the 128-dimensional MFCC feature is used as the input vector of the deep neural network.
  • step S203 the MFCC feature of each user is tagged with a user tag.
  • the user tag is used to identify the speaker to which the MFCC feature belongs.
  • Different users have different user tags for their corresponding MFCC features.
  • the 128-dimensional MFCC feature of each user needs to be tagged with a corresponding user label.
  • the following examples illustrate. Assuming that there are three users, user 1, user 2, and user 3, in step S203, user 1’s MFCC feature is labeled "01”, user 2’s MFCC feature is labeled “02”, and user 3’s MFCC feature is labeled “02”. Put the user tag "03".
  • the user tag may also be a tag of other forms.
  • step S204 the MFCC feature with the user tag is used as an input vector into a preset deep neural network for training.
  • the 128-dimensional MFCC feature with the same user label is used as an input vector, and then passed into a preset deep neural network for training, and the recognition result of the user is obtained.
  • the preset deep neural network includes an input layer, a four-layer fully connected layer, and an output layer.
  • Each fully connected layer is a 12-dimensional input, using the maxout excitation function, and the output expression of the hidden layer node is:
  • b represents the bias value
  • W represents the three-dimensional matrix composed of parameters
  • the size is d ⁇ m ⁇ k
  • d represents the number of nodes in the input layer
  • m represents the number of nodes in the hidden layer
  • k represents each The number of hidden hidden layer nodes corresponding to each hidden layer node, and the k hidden hidden layer nodes are all linearly output.
  • Each node of the maxout excitation function takes the maximum value among the output values of the k hidden layer nodes.
  • the number m of nodes in each fully connected layer is 12, and for each of the 12 nodes, take the maximum value of the output values of the k hidden layer nodes generated by the maxout excitation function, and combine the The maximum value corresponding to the 12 nodes is used as the output vector of the fully connected layer.
  • the embodiment of the present application uses the maxout excitation function to make the fully connected layer of the deep neural network non-linear conversion.
  • the deep neural network includes four fully connected layers, which are respectively denoted as the first fully connected layer, the second fully connected layer, the third fully connected layer, and the fourth fully connected layer.
  • the MFCC features with user labels are first passed through the first fully connected layer, and then the output vector of the first fully connected layer is used as the input vector of the second fully connected layer, and the second fully connected layer.
  • the output vector is used as the input vector of the third fully connected layer, the output vector of the third fully connected layer is used as the input vector of the fourth fully connected layer, and the output vector of the fourth fully connected layer is used as the input vector of the output layer.
  • the embodiment of the present application adopts a drop strategy, that is, a dropout strategy.
  • a drop strategy that is, a dropout strategy.
  • the first discarding probability and the second discarding probability are set according to actual requirements, and the embodiment of the present application is preferably 0.5.
  • step S205 a preset loss function is used to calculate the error between the recognition result of each MFCC feature through the deep neural network and the corresponding user tag, and the parameters of the deep neural network are modified according to the error .
  • each fully connected layer uses a maxout excitation function, which includes a three-dimensional parameter matrix W and a bias value b.
  • the error between the recognition result of each MFCC feature and the corresponding user tag is calculated using a preset loss function , And modify the parameter matrix W and the bias value b of the maxout excitation function in the deep neural network based on the error return.
  • the loss function includes but is not limited to a mutual entropy loss function and a square loss function.
  • step S206 the MFCC feature with the user tag is used as an input vector to pass into the modified deep neural network for the next iteration training, until the accuracy of the recognition result of each MFCC feature by the deep neural network reaches the specified Threshold, stop iteration.
  • the deep neural network whose parameters have been modified in step S205 is used for the next training, that is, the MFCC features with user tags are used as the input vector and then passed into the modified deep neural network for training.
  • the training process is the same as that in step S204.
  • Repeat steps S204, S205, and S206 until the accuracy of the recognition results of the MFCC features of all users by the deep neural network reaches the specified threshold, that is, the recognition results of each MFCC feature of the deep neural network and the corresponding user If the probability of the same label reaches the specified threshold, it indicates that each parameter in the deep neural network has been adjusted in place, it is determined that the training of the deep neural network has been completed, and the iteration is stopped.
  • the trained deep neural network can be used to extract the voiceprint vector from the speech signal.
  • step S102 a voice signal to be recognized is acquired.
  • the voice signal to be recognized is a short text, that is, a short-length voice signal, such as a sentence-length voice signal, so as to reduce the requirements for data.
  • the acquired voice signal to be recognized should be of a user to be recognized.
  • the voice signal to be recognized may be one voice signal or multiple voice signals.
  • step S103 preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the MFCC feature.
  • the step S103 Before using the deep neural network, first perform feature extraction on the speech signal to be recognized to obtain the corresponding MFCC feature.
  • the step S103 includes:
  • step S301 framing processing is performed on the waveform of the voice signal to be recognized.
  • the framing processing refers to cutting the waveform diagram of the voice signal of indefinite length into small segments of fixed length, usually 10-30 milliseconds as a frame. Since the speech signal changes rapidly, the Fourier transform is suitable for analyzing stationary signals. By framing the waveform of the speech signal, the intensity of the side lobe after Fourier transform can be reduced, and the quality of the obtained spectrum can be improved.
  • step S302 after framing processing, windowing processing is performed on each frame signal.
  • each frame signal is windowed to smooth the speech signal.
  • a Hamming window can be used for smoothing. Compared with a rectangular window function, the Hamming window enhances the continuity of the left and right ends of the speech signal, and can effectively reduce the intensity of side lobes and spectrum leakage after Fourier transform.
  • step S303 the discrete Fourier transform is performed on each frame signal after the windowing process to obtain the frequency spectrum corresponding to the frame signal.
  • step S304 the power spectrum of the speech signal is calculated according to the spectrum corresponding to all frame signals.
  • the energy distribution obtained is a frequency domain signal.
  • the energy of each frequency band is different, and the energy spectrum of different phonemes is also different. It is necessary to take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.
  • step S305 the Mel filter bank is calculated according to the power spectrum.
  • the Mel filter bank is a set of nonlinearly distributed filter banks, which are densely distributed in the low-frequency part and sparsely distributed in the high-frequency part, which can better meet the human hearing characteristics.
  • a set of filter banks including n triangular filters are applied to the voice signal, that is, the power spectrum of the voice signal is multiplied by a set of n triangular filters to increase the power of the voice signal.
  • the spectrum is transformed into an n-dimensional vector.
  • the triangular filter can eliminate the effect of harmonics, highlight the formant of the original voice signal, and thereby reduce the amount of data.
  • step S306 logarithmic operation is performed on the output of each Mel filter to obtain logarithmic energy.
  • Each element in the n-dimensional vector obtained through step S305 is the output of a mel filter in the mel filter bank, and the embodiment of the present application further performs logarithm for each element in the n-dimensional vector obtained By calculation, the logarithmic energy output by the Mel filter bank, that is, log-mel filer bank energies, is obtained. The logarithmic energy is used for subsequent cepstrum analysis.
  • step S307 the discrete cosine transform is performed on the logarithmic energy to obtain the MFCC feature of the speech signal.
  • the embodiment of the present application After obtaining the logarithmic energy of the voice signal by performing step S306 above, the embodiment of the present application performs discrete cosine transform on the logarithmic energy, and takes the low 128-dimensional coefficient in the output result as the MFCC of the voice signal feature.
  • the output result obtained by the discrete cosine transform has a good energy accumulation effect. The larger value is concentrated in the low-energy part near the upper left corner, and the remaining part produces a large number of 0 or close to 0.
  • the embodiment of the present application takes the low 128-dimensional value in the output result as the MFCC feature, so that the amount of data can be further compressed.
  • the MFCC feature does not depend on the nature of the signal and does not impose any restrictions on the input signal. It has high robustness and conforms to the hearing coefficient of the human ear. It still has good recognition performance when the signal-to-noise ratio is reduced.
  • the MFCC feature is used as the sound feature of the voice signal to be recognized, and is transmitted to the deep neural network for recognition, which can improve the accuracy of deep neural network recognition.
  • step S104 the MFCC feature is input to a pre-trained deep neural network, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal.
  • Each element in the voiceprint vector represents the characteristics of the voice signal.
  • the MFCC feature of the voice signal is obtained, the MFCC feature is passed as an input to a pre-trained deep neural network, and the voice signal is recognized based on the MFCC feature through the deep neural network.
  • the pre-trained deep neural network includes four fully connected layers, each fully connected layer includes 12 nodes, and a 12-dimensional output vector is obtained through the excitation function maxout function.
  • the output vector of the neural network in the last fully connected layer is obtained as the d-vector vector of the speech signal.
  • the d-vector vector is the voiceprint vector of the voice signal, and each element in it represents the voiceprint feature of the voice signal.
  • step S105 the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library, and the voiceprint detection result is output according to the comparison result.
  • the voiceprint model library is set according to needs in combination with the application scenarios of identity authentication, such as online payment, voiceprint lock control, and survival authentication.
  • identity authentication such as online payment, voiceprint lock control, and survival authentication.
  • the user who needs to be authenticated is identified in advance through the deep neural network, and the voiceprint vector is extracted and entered into the voiceprint model library.
  • the voiceprint vector of the voice signal to be recognized is compared with the pre-stored voiceprint vector in the voiceprint model library to perform speaker discrimination of the voice signal.
  • the step S105 includes:
  • step S401 the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library.
  • the embodiment of the present application compares the voiceprint vector of the voice signal with each pre-stored voiceprint vector in the voiceprint model library to determine whether the elements in the two are the same.
  • step S402 if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, user information corresponding to the pre-stored voiceprint vector is obtained, and the user information is output.
  • the authenticated user in the voiceprint model library obtains user information corresponding to the pre-stored voiceprint vector, and outputs the user information, thereby completing the recognition of the voice signal to be recognized.
  • step S403 if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
  • the short text-based voiceprint detection method described in the embodiments of the present application can be applied to a series of application scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, and survival authentication, and can also be used in IoT device verification.
  • identity authentication such as online payment, voiceprint lock control, and survival authentication
  • IoT device verification Especially in remote verification where video image verification is inconvenient, it is not restricted by equipment at all, and the identity can be confirmed by telephone, which can greatly reduce the cost of remote verification.
  • the embodiment of the present application redesigns the deep neural network suitable for short text in advance, and then uses the training samples of the short text to train the preset deep neural network; when performing voiceprint detection, obtain the A voice signal, the voice signal is a short text; the voice signal to be recognized is preprocessed, and the preprocessed voice signal is feature extracted to obtain the MFCC feature; the MFCC feature is passed as input
  • a pre-trained deep neural network obtains the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the voice signal, and each element in the voiceprint vector represents the value of the voice signal Features; compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output the voiceprint detection result according to the comparison result; thus, the voiceprint detection based on short text is realized, greatly
  • the input vector of the model is reduced, and the problems of long speech signal, large amount of sample information, and high computing resource requirements in the existing voiceprint detection methods are solved.
  • a short text-based voiceprint detection device is provided, and the short text-based voiceprint detection device corresponds to the short text-based voiceprint detection method in the foregoing embodiment.
  • the short text-based voiceprint detection device includes a training module, an information acquisition module, a feature extraction module, a feature acquisition module, and a detection module.
  • the detailed description of each functional module is as follows:
  • the training module 51 is used for the training module, used to obtain training samples, and use the training samples to train a preset deep neural network;
  • the signal acquisition module 52 is used to acquire the voice signal to be recognized
  • the feature extraction module 53 is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
  • the feature acquisition module 54 is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the speech signal
  • the detection module 55 is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;
  • the training samples and speech signals are both short texts.
  • the training module 51 includes:
  • the sample acquisition unit is used to acquire voice samples of multiple users as training samples
  • the feature extraction unit is configured to preprocess the training samples of each user, and perform feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
  • the tag unit is used to tag the Mel frequency cepstrum coefficient of each user with a user tag
  • the training unit is used to input the Mel frequency cepstrum coefficients with user tags as input vectors into the preset deep neural network for training;
  • the parameter modification unit is used to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag using a preset loss function, and modify the Parameters of deep neural network;
  • the training unit is also used to pass the Mel frequency cepstrum coefficients with user labels as an input vector to the modified deep neural network for the next iterative training, until the deep neural network performs the next iteration of training for each Mel frequency The accuracy of the recognition result of the cepstral coefficient reaches the specified threshold, and the iteration is stopped.
  • the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Dropout strategy for training.
  • the feature extraction module 53 includes:
  • the framing unit is configured to perform framing processing on the waveform diagram of the voice signal to be recognized
  • the windowing unit is used to perform windowing processing on each frame of signal after framing processing
  • a transforming unit for performing discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal
  • a power spectrum calculation unit configured to calculate the power spectrum of the voice signal according to the spectrum corresponding to all frame signals
  • a filter bank calculation unit for calculating a mel filter bank according to the power spectrum
  • Logarithmic unit used to perform logarithmic operation on the output of each mel filter to obtain logarithmic energy
  • the cosine transform unit is configured to perform discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the voice signal.
  • the detection module 55 includes:
  • the comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library
  • the first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;
  • the second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
  • each module in the aforementioned short text-based voiceprint detection device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a short text-based voiceprint detection method.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the training samples and speech signals are both short texts.
  • one or more non-volatile readable storage media storing computer readable instructions are provided.
  • the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
  • the training samples and speech signals are both short texts.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

Disclosed are a voiceprint detection method, apparatus and device based on a short text, and a storage medium. The method comprises: training a preset deep neural network by means of a training sample; acquiring a speech signal to be recognized; preprocessing the speech signal to be recognized, and carrying out feature extraction on the preprocessed speech signal to obtain a mel-frequency cepstral coefficient; taking the mel-frequency cepstral coefficient as an input and transmitting same into the pretrained deep neural network, and acquiring an output vector of the deep neural network on a last full connection layer and taking same as a voiceprint vector of the speech signal; and comparing the voiceprint vector of the speech signal with a voiceprint vector prestored in a voiceprint model library, and outputting a voiceprint detection result according to a comparison result, wherein the training sample and the speech signal are both a short text. The present application solves the problems of a redundant speech signal, a large quantity of sample information and a high requirement for computing resources in an existing voiceprint detection method.

Description

基于短文本的声纹检测方法、装置、设备及存储介质Voiceprint detection method, device, equipment and storage medium based on short text
本申请以2019年3月6日提交的申请号为201910167882.3,名称为“基于短文本的声纹检测方法、装置、设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on March 6, 2019 with the application number 201910167882.3, titled "Short text-based voiceprint detection method, device, equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及信息技术领域,尤其涉及一种基于短文本的声纹检测方法、装置、设备及存储介质。This application relates to the field of information technology, and in particular to a short text-based voiceprint detection method, device, equipment and storage medium.
背景技术Background technique
声纹检测是一种常见的有效的身份识别方法,可以应用于网络支付、声纹锁控、生存认证、物联网设备验证等一系列需要结合身份认证的场景,尤其在采用视频图像验证不方便的远程验证中,完全不受设备限制。在进行验证时,采用内容和声纹检测进行双重验证,可以大大地提高被攻击的门槛,提升安全性。在进行声纹检测时,目前常用的方法包括但不限于模板匹配法、概率模型法、人工神经网络法、I-vector模型法。然而这些方法中,由于受限于模型本身的结构,使用短文本难以完成文本训练,因此,只能通常采用特征较多的长文本作为模型输入向量。然而,语音信号越冗长,携带的特征越多,在训练时需要的样本信息量大,占用的计算机资源多。Voiceprint detection is a common and effective identification method, which can be applied to a series of scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, survival authentication, and Internet of Things device verification. It is especially inconvenient to use video image verification In the remote verification, there is no device restriction at all. When verifying, using content and voiceprint detection for double verification can greatly increase the threshold of being attacked and improve security. When performing voiceprint detection, currently commonly used methods include but are not limited to template matching method, probability model method, artificial neural network method, and I-vector model method. However, in these methods, due to the limitation of the structure of the model itself, it is difficult to complete text training using short text. Therefore, long text with more features can usually be used as the model input vector. However, the longer the speech signal, the more features it carries, the larger the amount of sample information required during training, and the more computer resources it takes up.
发明内容Summary of the invention
本申请实施例提供了一种基于短文本的声纹检测方法、装置、设备及存储介质,以解决现有声纹检测方法中语音信号冗长、样本信息量大、运算资源要求高的问题。The embodiments of the present application provide a short text-based voiceprint detection method, device, equipment, and storage medium to solve the problems of long voice signals, large sample information, and high computing resource requirements in the existing voiceprint detection methods.
一种基于短文本的声纹检测方法,包括:A voiceprint detection method based on short text, including:
获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
获取待识别的语音信号;Obtain the voice signal to be recognized;
对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
可选地,所述获取训练样本,采用所述训练样本对预设的深度神经网络进行训练包括:Optionally, the acquiring training samples and using the training samples to train a preset deep neural network includes:
获取多个用户的语音样本作为训练样本;Acquire voice samples of multiple users as training samples;
对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
对每一个所述用户的梅尔频率倒谱系数打上用户标签;Labeling a user tag on the Mel frequency cepstrum coefficient of each user;
将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;
采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;
将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
可选地,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。Optionally, the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Discard strategy for training.
可选地,所述将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果包括:Optionally, the comparing the voiceprint vector of the voice signal with a pre-stored voiceprint vector in a voiceprint model library, and outputting a voiceprint detection result according to the comparison result includes:
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;
若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
可选地,所述对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数包括:Optionally, the preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient includes:
对所述待识别的语音信号的波形图执行分帧处理;Performing framing processing on the waveform diagram of the voice signal to be recognized;
在分帧处理之后,对每一帧信号执行加窗处理;After framing processing, perform windowing processing on each frame of signal;
对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
根据所有帧信号对应的频谱计算所述语音信号的功率谱;Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;
根据所述功率谱计算梅尔滤波器组;Calculating a mel filter bank according to the power spectrum;
对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
一种基于短文本的声纹检测装置,包括:A voiceprint detection device based on short text, including:
训练模块,用于获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;The training module is used to obtain training samples, and use the training samples to train a preset deep neural network;
信号获取模块,用于获取待识别的语音信号;The signal acquisition module is used to acquire the voice signal to be recognized;
特征提取模块,用于对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;The feature extraction module is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
特征获取模块,用于将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The feature acquisition module is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the voice signal A voiceprint vector, where each element in the voiceprint vector represents a feature of the voice signal;
检测模块,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;The detection module is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
可选地,所述检测模块包括:Optionally, the detection module includes:
比对单元,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;The comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
第一结果输出单元,用于若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;The first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;
第二结果输出单元,用于若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。The second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
可选地,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。Optionally, the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Discard strategy for training.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:
获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
获取待识别的语音信号;Obtain the voice signal to be recognized;
对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
获取待识别的语音信号;Obtain the voice signal to be recognized;
对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中基于短文本的声纹检测方法的一流程图;FIG. 1 is a flowchart of a voiceprint detection method based on short text in an embodiment of the present application;
图2是本申请一实施例中基于短文本的声纹检测方法中步骤S101的一流程图;2 is a flowchart of step S101 in the voiceprint detection method based on short text in an embodiment of the present application;
图3是本申请一实施例中基于短文本的声纹检测方法中步骤S103的一流程图;3 is a flowchart of step S103 in the voiceprint detection method based on short text in an embodiment of the present application;
图4是本申请一实施例中基于短文本的声纹检测方法中步骤S105的一流程图;4 is a flowchart of step S105 in the short text-based voiceprint detection method in an embodiment of the present application;
图5是本申请一实施例中基于短文本的声纹检测装置的一原理框图;FIG. 5 is a functional block diagram of a voiceprint detection device based on short text in an embodiment of the present application;
图6是本申请一实施例中计算机设备的一示意图。Fig. 6 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
本申请实施例提供的基于短文本的声纹检测方法应用于服务器。所述服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。在一实施例中,如图1所示,提供一种基于短文本的声纹检测方法,包括如下步骤:The voiceprint detection method based on short text provided by the embodiment of the present application is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for voiceprint detection based on short text is provided, which includes the following steps:
在步骤S101中,获取训练样本,采用所述训练样本对预设的深度神经网络进行训练。In step S101, a training sample is obtained, and the training sample is used to train a preset deep neural network.
在这里,本申请实施例重新设计了适用于短文本的深度神经网络,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。这样,所述深度神经网络就可以不受模型结构限制,能够以短文本为训练样本、输入向量,从而降低对数据的要求。其中,短文本即长度较短的语音信号。比如,一个句子长度的语音信号。可选地,可以通过指定长度来所述短文本,比如小于或等于所述指定长度的语音信号。本申请实施例收集多个用户的语音样本作为训练样本,并基于所述训练样本对预设的深度神经网络进行训练。可选地,如图2所示,所述步骤S101包括:Here, the embodiment of the application redesigned a deep neural network suitable for short text. The deep neural network includes an input layer, a four-layer fully connected layer, and an output layer. Each fully connected layer is a 12-dimensional input and is excited by maxout. Function, and the third fully connected layer and the fourth fully connected layer are trained using a discard strategy. In this way, the deep neural network is not limited by the model structure, and can use short texts as training samples and input vectors, thereby reducing data requirements. Among them, short text refers to a voice signal with a shorter length. For example, a sentence-length speech signal. Optionally, the short text may be specified by a length, such as a voice signal less than or equal to the specified length. In this embodiment of the application, voice samples of multiple users are collected as training samples, and a preset deep neural network is trained based on the training samples. Optionally, as shown in FIG. 2, the step S101 includes:
在步骤S201中,获取多个用户的语音样本作为训练样本。In step S201, voice samples of multiple users are obtained as training samples.
在本实施例中,针对实际应用场景,可以预先在具体应用场景下收集多个用户对应的语音样本,比如可以通过专业知识库、网络数据库等渠道收集各个用户对应的语音样本,作为训练样本。In this embodiment, for actual application scenarios, voice samples corresponding to multiple users can be collected in advance in specific application scenarios. For example, voice samples corresponding to each user can be collected through channels such as professional knowledge bases, network databases, etc., as training samples.
在步骤S202中,对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到MFCC特征。In step S202, the training samples of each user are preprocessed, and feature extraction is performed on the preprocessed training samples to obtain MFCC features.
在这里,MFCC特征(梅尔倒频谱系数,Mel-scale Frequency Cepstral Coefficients,简称MFCC)是一种语音信号中具有辨识性的成分,是在Mel标度频率域提取出来的倒谱参数,其参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。本申请实施例基于所述MFCC特征设计深度神经网络,以所述MFCC特征作为深度神经网络的输入。在训练所述深度神经网络之前,首先对所述用户样本进行预处理和特征提取,得到对应的MFCC特征。对所述用户的训练样本进行预处理和特征提取与步骤S103相同,具体请参见步骤S103的叙述,此处不再赘述。Here, the MFCC feature (Mel-scale Frequency Cepstral Coefficients, MFCC for short) is a recognizable component in the speech signal. It is a cepstral parameter extracted in the frequency domain of the Mel scale. Taking into account the human ear's perception of different frequencies, it is especially suitable for speech recognition and speaker recognition. The embodiment of the present application designs a deep neural network based on the MFCC feature, and uses the MFCC feature as the input of the deep neural network. Before training the deep neural network, first perform preprocessing and feature extraction on the user samples to obtain corresponding MFCC features. The preprocessing and feature extraction of the training samples of the user are the same as step S103. For details, please refer to the description of step S103, which will not be repeated here.
本申请实施例通过对所述预处理后的训练样本进行特征提取,得到所述训练样本对应的一组128维MFCC特征。所述128维MFCC特征作为所述深度神经网络的输入向量。In the embodiment of the present application, a set of 128-dimensional MFCC features corresponding to the training sample is obtained by performing feature extraction on the preprocessed training sample. The 128-dimensional MFCC feature is used as the input vector of the deep neural network.
在步骤S203中,对每一个所述用户的MFCC特征打上用户标签。In step S203, the MFCC feature of each user is tagged with a user tag.
在本申请实施例中,所述用户标签用于标识所述MFCC特征所属的说话人。不同用户,其对应的MFCC特征所打的用户标签不同。在对深度神经网络训练之前,需要对每一所述用户的128维MFCC特征打上对应的用户标签。为了便于理解,以下举例说明。假设存在三个用户,用户1、用户2、用户3,通过步骤S203对用户1的MFCC特征打上用户标签“01”,对用户2的MFCC特征打上用户标签“02”,对用户3的MFCC特征打上用户标签“03”。应当理解,以上仅为本申请的一个示例,并不用于限制本申请,在其他实施例中,所述用户标签还可以为其他形式的标签。In this embodiment of the application, the user tag is used to identify the speaker to which the MFCC feature belongs. Different users have different user tags for their corresponding MFCC features. Before training the deep neural network, the 128-dimensional MFCC feature of each user needs to be tagged with a corresponding user label. In order to facilitate understanding, the following examples illustrate. Assuming that there are three users, user 1, user 2, and user 3, in step S203, user 1’s MFCC feature is labeled "01", user 2’s MFCC feature is labeled “02”, and user 3’s MFCC feature is labeled “02”. Put the user tag "03". It should be understood that the above is only an example of the present application and is not used to limit the present application. In other embodiments, the user tag may also be a tag of other forms.
在步骤S204中,将带有用户标签的MFCC特征作为输入向量传入预设的深度神经网络进行训练。In step S204, the MFCC feature with the user tag is used as an input vector into a preset deep neural network for training.
在训练时,针对每一个用户,将带有同一用户标签的128维MFCC特征作为一个输入向量,传入预设的深度神经网络进行训练,得到所述用户的识别结果。During training, for each user, the 128-dimensional MFCC feature with the same user label is used as an input vector, and then passed into a preset deep neural network for training, and the recognition result of the user is obtained.
在这里,所述预设的深度神经网络包括输入层、四层全连接层以及输出层。每一全连接层为12维输入,使用的是maxout激发函数,其隐含层节点的输出表达式为:Here, the preset deep neural network includes an input layer, a four-layer fully connected layer, and an output layer. Each fully connected layer is a 12-dimensional input, using the maxout excitation function, and the output expression of the hidden layer node is:
Figure PCTCN2019117731-appb-000001
Figure PCTCN2019117731-appb-000001
Figure PCTCN2019117731-appb-000002
Figure PCTCN2019117731-appb-000002
在上式中,b表示偏置值,W表示由参数组成的三维矩阵,尺寸为d×m×k,d表示输入层的节点个数,m表示隐含层的节点个数,k表示每个隐含层节点对应的隐隐含层的节点个数,所述k个隐隐含层的节点都是线性输出的。maxout激发函数的每个节点均为取所述k个隐隐含层节点输出值中的最大值。In the above formula, b represents the bias value, W represents the three-dimensional matrix composed of parameters, the size is d×m×k, d represents the number of nodes in the input layer, m represents the number of nodes in the hidden layer, and k represents each The number of hidden hidden layer nodes corresponding to each hidden layer node, and the k hidden hidden layer nodes are all linearly output. Each node of the maxout excitation function takes the maximum value among the output values of the k hidden layer nodes.
在本申请实施例中,每一全连接层的节点个数m为12,12个节点中的每一个节点,取maxout激发函数生成的k个隐隐含层节点输出值中的最大值,组合该12个节点对应的最大值,作为该全连接层的输出向量。本申请实施例通过使用maxout激发函数,使得深度神经网络的全连接层为非线性转换。In the embodiment of the present application, the number m of nodes in each fully connected layer is 12, and for each of the 12 nodes, take the maximum value of the output values of the k hidden layer nodes generated by the maxout excitation function, and combine the The maximum value corresponding to the 12 nodes is used as the output vector of the fully connected layer. The embodiment of the present application uses the maxout excitation function to make the fully connected layer of the deep neural network non-linear conversion.
进一步地,在本申请实施例中,所述深度神经网络包括四层全连接层,分别记为第一全连接层、第二全连接层、第三全连接层、第四全连接层。在进行训练时,首先将所述带有用户标签的MFCC特征经过第一全连接层,然后将第一全连接层的输出向量作为第二全连接层的输入向量,将第二全连接层的输出向量作为第三全连接层的输入向量,将第三全连接层的输出向量作为第四全连接层的输入向量,将第四全连接层的输出向量作为输出层的输入向量。在第三全连接层和第四全连接层进行训练时,本申请实施例采用丢弃策略,即dropout策略。第二全连接层的输出向量传入第三全连接层时,按照预设第一丢弃概率随机丢弃第三全连接层的输出向量中的元素。应当理解,丢弃是指把这些元素从网络中“抹去”,相当于在本次训练中,这些被“抹去”的元素不参与本次训练。然后使用第三全连接层的maxout激发函数对剩余的元素进行训练,生成第三全连接层的输出向量。再按照预设第二丢弃概率随机丢弃第三全连接层得到的输出向量中的元素,将剩余的元素输入第四全连接层进行训练。在这 里,所述第一丢弃概率和第二丢弃概率根据实际需求设定,本申请实施例优选为0.5。通过使用dropout策略,有效地削弱了隐含层节点间的联合适应性,增强了泛化能力,从而防止了深度神经网络在训练过程中过拟合,有利于提升深度神经网络的训练效果。Further, in the embodiment of the present application, the deep neural network includes four fully connected layers, which are respectively denoted as the first fully connected layer, the second fully connected layer, the third fully connected layer, and the fourth fully connected layer. During training, the MFCC features with user labels are first passed through the first fully connected layer, and then the output vector of the first fully connected layer is used as the input vector of the second fully connected layer, and the second fully connected layer The output vector is used as the input vector of the third fully connected layer, the output vector of the third fully connected layer is used as the input vector of the fourth fully connected layer, and the output vector of the fourth fully connected layer is used as the input vector of the output layer. When the third fully connected layer and the fourth fully connected layer are trained, the embodiment of the present application adopts a drop strategy, that is, a dropout strategy. When the output vector of the second fully connected layer is passed into the third fully connected layer, the elements in the output vector of the third fully connected layer are randomly discarded according to the preset first discarding probability. It should be understood that discarding refers to "erasing" these elements from the network, which is equivalent to these "erased" elements not participating in this training in this training. Then use the maxout excitation function of the third fully connected layer to train the remaining elements to generate the output vector of the third fully connected layer. Then randomly discard the elements in the output vector obtained by the third fully connected layer according to the preset second discarding probability, and input the remaining elements into the fourth fully connected layer for training. Here, the first discarding probability and the second discarding probability are set according to actual requirements, and the embodiment of the present application is preferably 0.5. By using the dropout strategy, the joint adaptability between hidden layer nodes is effectively weakened, and the generalization ability is enhanced, thereby preventing the deep neural network from overfitting during the training process, which is conducive to improving the training effect of the deep neural network.
在步骤S205中,采用预设的损失函数计算每一所述MFCC特征经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数。In step S205, a preset loss function is used to calculate the error between the recognition result of each MFCC feature through the deep neural network and the corresponding user tag, and the parameters of the deep neural network are modified according to the error .
所述深度神经网络经过四层全连接层后,第四全连接层的输出向量作为输出层的输入。输出层为softmax层,softmax层能够根据第四全连接层的输出向量进行分类,得到MFCC特征的识别结果。所述识别结果为所述深度神经网络预测所述MFCC特征所属的用户。如前所述,每一全连接层采用maxout激发函数,maxout激发函数包括一个三维的参数矩阵W和偏置值b。在通过步骤S204完成对每一所述MFCC特征的训练得到所述MFCC特征对应的识别结果后,采用预设的损失函数计算每一所述MFCC特征的识别结果与对应的用户标签之间的误差,并基于所述误差返回去修改所述深度神经网络中maxout激发函数的参数矩阵W和偏置值b。可选地,所述损失函数包括但不限于互熵损失函数、平方损失函数。After the deep neural network passes through four fully connected layers, the output vector of the fourth fully connected layer is used as the input of the output layer. The output layer is the softmax layer, and the softmax layer can be classified according to the output vector of the fourth fully connected layer to obtain the recognition result of the MFCC feature. The recognition result is that the deep neural network predicts the user to which the MFCC feature belongs. As mentioned earlier, each fully connected layer uses a maxout excitation function, which includes a three-dimensional parameter matrix W and a bias value b. After completing the training of each MFCC feature through step S204 to obtain the recognition result corresponding to the MFCC feature, the error between the recognition result of each MFCC feature and the corresponding user tag is calculated using a preset loss function , And modify the parameter matrix W and the bias value b of the maxout excitation function in the deep neural network based on the error return. Optionally, the loss function includes but is not limited to a mutual entropy loss function and a square loss function.
在步骤S206中,将带有用户标签的MFCC特征作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一MFCC特征的识别结果的准确率达到指定阈值,停止迭代。In step S206, the MFCC feature with the user tag is used as an input vector to pass into the modified deep neural network for the next iteration training, until the accuracy of the recognition result of each MFCC feature by the deep neural network reaches the specified Threshold, stop iteration.
通过步骤S205修改参数后的深度神经网络,用于进行下一次训练,即将带有用户标签的MFCC特征作为输入向量再次传入参数修改后的深度神经网络进行训练,训练过程和步骤S204的相同,具体参见上面的叙述,此处不再赘述。重复迭代步骤S204、S205、S206,直至所述深度神经网络对所有用户的MFCC特征的识别结果的准确率达到指定阈值,即所述深度神经网络每一所述MFCC特征的识别结果与对应的用户标签相同的概率达到所述指定阈值,则说明所述深度神经网络中的各个参数已经调整到位,确定所述深度神经网络已训练完成,停止迭代。The deep neural network whose parameters have been modified in step S205 is used for the next training, that is, the MFCC features with user tags are used as the input vector and then passed into the modified deep neural network for training. The training process is the same as that in step S204. For details, please refer to the above description, which will not be repeated here. Repeat steps S204, S205, and S206 until the accuracy of the recognition results of the MFCC features of all users by the deep neural network reaches the specified threshold, that is, the recognition results of each MFCC feature of the deep neural network and the corresponding user If the probability of the same label reaches the specified threshold, it indicates that each parameter in the deep neural network has been adjusted in place, it is determined that the training of the deep neural network has been completed, and the iteration is stopped.
训练好的深度神经网络可用于对语音信号提取声纹向量。The trained deep neural network can be used to extract the voiceprint vector from the speech signal.
在步骤S102中,获取待识别的语音信号。In step S102, a voice signal to be recognized is acquired.
所述待识别的语音信号为短文本,即长度较短的语音信号,比如一个句子长度的语音信号,以降低对数据的要求。在每一次识别过程中,所获取的待识别的语音信号应当为一个待识别用户的。所述待识别的语音信号可以是一条语音信号或者多条语音信号。The voice signal to be recognized is a short text, that is, a short-length voice signal, such as a sentence-length voice signal, so as to reduce the requirements for data. In each recognition process, the acquired voice signal to be recognized should be of a user to be recognized. The voice signal to be recognized may be one voice signal or multiple voice signals.
在步骤S103中,对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到MFCC特征。In step S103, preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the MFCC feature.
在使用深度神经网络之前,首先对待识别的语音信号进行特征提取,得到对应的MFCC特征。可选地,如图3所示,所述步骤S103包括:Before using the deep neural network, first perform feature extraction on the speech signal to be recognized to obtain the corresponding MFCC feature. Optionally, as shown in FIG. 3, the step S103 includes:
在步骤S301中,对所述待识别的语音信号的波形图执行分帧处理。In step S301, framing processing is performed on the waveform of the voice signal to be recognized.
在这里,分帧处理是指将不定长度的语音信号的波形图切分成长度固定的小段,通常取10-30毫秒为一帧。由于语音信号是快速变化的,而傅里叶变换适用于分析平稳的信号。通过对语音信号的波形图进行分帧,可以降低傅里叶变换后旁瓣的强度,提高获取的频谱质量。Here, the framing processing refers to cutting the waveform diagram of the voice signal of indefinite length into small segments of fixed length, usually 10-30 milliseconds as a frame. Since the speech signal changes rapidly, the Fourier transform is suitable for analyzing stationary signals. By framing the waveform of the speech signal, the intensity of the side lobe after Fourier transform can be reduced, and the quality of the obtained spectrum can be improved.
在步骤S302中,在分帧处理之后,对每一帧信号执行加窗处理。In step S302, after framing processing, windowing processing is performed on each frame signal.
本申请实施例通过对每一帧信号进行加窗处理,以平滑该语音信号。可选地,可以使用汉明窗加以平滑,相比于矩形窗函数,汉明窗加强了语音信号左端和右端的连续性,可以有效地减弱傅里叶变换后旁瓣的强度以及频谱泄露。In the embodiment of the present application, each frame signal is windowed to smooth the speech signal. Optionally, a Hamming window can be used for smoothing. Compared with a rectangular window function, the Hamming window enhances the continuity of the left and right ends of the speech signal, and can effectively reduce the intensity of side lobes and spectrum leakage after Fourier transform.
在步骤S303中,对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱。In step S303, the discrete Fourier transform is performed on each frame signal after the windowing process to obtain the frequency spectrum corresponding to the frame signal.
由于语音信号在时域上的变化很难看出语音信号的特性,因此需要将语音信号转换成频域上的能量分布来观察。不同的能量分布表示不同语音的特性。在对每一帧语音信号进行加窗处理后,再进行离散傅里叶变换,得到该帧信号在频谱上的能量分布。对分帧加窗后的各帧信号进行离散傅里叶变换得到各帧的频谱,进而得到语音信号的频谱。Since it is difficult to see the characteristics of the voice signal when the voice signal changes in the time domain, it is necessary to convert the voice signal into an energy distribution in the frequency domain for observation. Different energy distributions represent the characteristics of different voices. After windowing is performed on each frame of speech signal, discrete Fourier transform is performed to obtain the energy distribution of the frame signal on the frequency spectrum. Discrete Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame, and then the frequency spectrum of the speech signal.
在步骤S304中,根据所有帧信号对应的频谱计算所述语音信号的功率谱。In step S304, the power spectrum of the speech signal is calculated according to the spectrum corresponding to all frame signals.
在完成离散傅里叶变换后,得到的能量分布是频域信号。每一个频带范围的能量大小不一,不同音 素的能量谱也不一样,需要对所述语音信号的频谱取模平方得到所述语音信号的功率谱。After completing the discrete Fourier transform, the energy distribution obtained is a frequency domain signal. The energy of each frequency band is different, and the energy spectrum of different phonemes is also different. It is necessary to take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.
在步骤S305中,根据所述功率谱计算梅尔滤波器组。In step S305, the Mel filter bank is calculated according to the power spectrum.
在这里,梅尔滤波器组是一组非线性分布的滤波器组,其在低频部分分布密集,在高频部分分布稀疏,可以更好地满足人耳听觉特性。本申请实施例将一组包括n个三角滤波器的滤波器组作用到所述语音信号,即将所述语音信号的功率谱乘以一组n个三角滤波器,以将所述语音信号的功率谱转化为n维向量。在这里,所述三角滤波器能够消除谐波的作用,突显原有语音信号的共振峰,进而降低数据量。Here, the Mel filter bank is a set of nonlinearly distributed filter banks, which are densely distributed in the low-frequency part and sparsely distributed in the high-frequency part, which can better meet the human hearing characteristics. In the embodiment of the present application, a set of filter banks including n triangular filters are applied to the voice signal, that is, the power spectrum of the voice signal is multiplied by a set of n triangular filters to increase the power of the voice signal. The spectrum is transformed into an n-dimensional vector. Here, the triangular filter can eliminate the effect of harmonics, highlight the formant of the original voice signal, and thereby reduce the amount of data.
在步骤S306中,对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量。In step S306, logarithmic operation is performed on the output of each Mel filter to obtain logarithmic energy.
通过步骤S305得到的n维向量中的每一个元素为梅尔滤波器组中的一个梅尔滤波器的输出,本申请实施例进一步对所得到的n维向量中的每一个元素进行取对数运算,得到所述梅尔滤波器组输出的对数能量,即log-mel filer bank energies。所述对数能量应用于后续进行倒谱分析。Each element in the n-dimensional vector obtained through step S305 is the output of a mel filter in the mel filter bank, and the embodiment of the present application further performs logarithm for each element in the n-dimensional vector obtained By calculation, the logarithmic energy output by the Mel filter bank, that is, log-mel filer bank energies, is obtained. The logarithmic energy is used for subsequent cepstrum analysis.
在步骤S307中,对所述对数能量执行离散余弦变换,得到所述语音信号的MFCC特征。In step S307, the discrete cosine transform is performed on the logarithmic energy to obtain the MFCC feature of the speech signal.
在通过对上述步骤S306得到所述语音信号的对数能量,本申请实施例对所述对数能量进行离散余弦变换,并取输出结果中的低128维的系数,作为所述语音信号的MFCC特征。在这里,通过离散余弦变换得到的输出结果具有很好的能量聚集效应,较大的值集中在靠近左上角的低能量部分,其余部分产生大量的0或者接近0的数。本申请实施例取输出结果中低128维的值,作为MFCC特征,从而可以进一步压缩数据量。After obtaining the logarithmic energy of the voice signal by performing step S306 above, the embodiment of the present application performs discrete cosine transform on the logarithmic energy, and takes the low 128-dimensional coefficient in the output result as the MFCC of the voice signal feature. Here, the output result obtained by the discrete cosine transform has a good energy accumulation effect. The larger value is concentrated in the low-energy part near the upper left corner, and the remaining part produces a large number of 0 or close to 0. The embodiment of the present application takes the low 128-dimensional value in the output result as the MFCC feature, so that the amount of data can be further compressed.
其中,MFCC特征不依赖于信号的性质,对输入信号不做任何的限制,具有较高的鲁棒性,符合人耳的听觉系数,当信噪比降低时仍然具有较好的识别性能,以所述MFCC特征作为所述待识别的语音信号的声音特征,传入深度神经网络中进行识别,可以提高深度神经网络识别的准确度。Among them, the MFCC feature does not depend on the nature of the signal and does not impose any restrictions on the input signal. It has high robustness and conforms to the hearing coefficient of the human ear. It still has good recognition performance when the signal-to-noise ratio is reduced. The MFCC feature is used as the sound feature of the voice signal to be recognized, and is transmitted to the deep neural network for recognition, which can improve the accuracy of deep neural network recognition.
在步骤S104中,将所述MFCC特征作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征。In step S104, the MFCC feature is input to a pre-trained deep neural network, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the characteristics of the voice signal.
在得到所述语音信号的MFCC特征之后,将所述MFCC特征作为输入传入至预先训练好的深度神经网络,通过所述深度神经网络基于所述MFCC特征对所述语音信号进行识别。在这里,所述预先训练好的深度神经网络中的包括四层全连接层,每一层全连接层包括12个节点,通过激发函数maxout函数得到一个12维的输出向量。当所述深度神经网络完成对所述语音信号的识别后,获取所述神经网络在最后一层全连接层的输出向量,作为所述语音信号的d-vector向量。所述d-vector向量为所述语音信号的声纹向量,其中的每个元素表示所述语音信号的声纹特征。After the MFCC feature of the voice signal is obtained, the MFCC feature is passed as an input to a pre-trained deep neural network, and the voice signal is recognized based on the MFCC feature through the deep neural network. Here, the pre-trained deep neural network includes four fully connected layers, each fully connected layer includes 12 nodes, and a 12-dimensional output vector is obtained through the excitation function maxout function. After the deep neural network completes the recognition of the speech signal, the output vector of the neural network in the last fully connected layer is obtained as the d-vector vector of the speech signal. The d-vector vector is the voiceprint vector of the voice signal, and each element in it represents the voiceprint feature of the voice signal.
在步骤S105中,将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果。In step S105, the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library, and the voiceprint detection result is output according to the comparison result.
在这里,所述声纹模型库根据需要结合身份认证的应用场景进行设置,比如网络支付、声纹锁控、生存认证等。所述声纹模型库中有多个预存声纹向量及其对应的用户信息。在具体的应用场景中,预先通过所述深度神经网络对需要进行认证的用户进行识别,提取声纹向量,并录入至所述声纹模型库中。Here, the voiceprint model library is set according to needs in combination with the application scenarios of identity authentication, such as online payment, voiceprint lock control, and survival authentication. There are multiple pre-stored voiceprint vectors and their corresponding user information in the voiceprint model library. In a specific application scenario, the user who needs to be authenticated is identified in advance through the deep neural network, and the voiceprint vector is extracted and entered into the voiceprint model library.
在进行声纹检测时,将所述待识别的语音信号的声纹向量与所述声纹模型库中的预存声纹向量进行比对,以执行对所述语音信号的语者辨别。可选地,如图4所示,所述步骤S105包括:When performing voiceprint detection, the voiceprint vector of the voice signal to be recognized is compared with the pre-stored voiceprint vector in the voiceprint model library to perform speaker discrimination of the voice signal. Optionally, as shown in FIG. 4, the step S105 includes:
在步骤S401中,将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对。In step S401, the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library.
在这里,本申请实施例将所述语音信号的声纹向量与声纹模型库中的每一预存声纹向量进行比对,判断两者中的元素是否相同。Here, the embodiment of the present application compares the voiceprint vector of the voice signal with each pre-stored voiceprint vector in the voiceprint model library to determine whether the elements in the two are the same.
在步骤S402中,若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息。In step S402, if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, user information corresponding to the pre-stored voiceprint vector is obtained, and the user information is output.
若声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,表明所述待识别的语音信号的说话人已录入到声纹模型库中,所述语音信号属于所述声纹模型库中已认证的用户,获取所述预存声纹向量对应的用户信息,输出所述用户信息,从而完成对所述待识别的语音信号的识别。If there is a pre-stored voiceprint vector in the voiceprint model database that is the same as the voiceprint vector of the voice signal, it indicates that the speaker of the voice signal to be recognized has been entered into the voiceprint model database, and the voice signal belongs to all The authenticated user in the voiceprint model library obtains user information corresponding to the pre-stored voiceprint vector, and outputs the user information, thereby completing the recognition of the voice signal to be recognized.
在步骤S403中,若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。In step S403, if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,表明所述待识别的 语音信号的说话者未录入到声纹模型库中,所述语音信号不属于所述声纹模型库中已认证的用户,则输出校验失败的提示信息。If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model database, it indicates that the speaker of the voice signal to be recognized has not been entered into the voiceprint model database, and the voice If the signal does not belong to an authenticated user in the voiceprint model library, a prompt message indicating that the verification fails is output.
本申请实施例所述的基于短文本的声纹检测方法可应用于网络支付、声纹锁控、生存认证等一系列需要结合身份认证的应用场景,也可用于在物联网设备验证中。尤其在采用视频图像验证不方便的远程验证中,完全不受设备的限制,通过电话即可确认身份,可以极大地减小远程验证的成本。The short text-based voiceprint detection method described in the embodiments of the present application can be applied to a series of application scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, and survival authentication, and can also be used in IoT device verification. Especially in remote verification where video image verification is inconvenient, it is not restricted by equipment at all, and the identity can be confirmed by telephone, which can greatly reduce the cost of remote verification.
综上所述,本申请实施例通过预先重新设计适用于短文本的深度神经网络,然后采用短文本的训练样本对预设的深度神经网络进行训练;在进行声纹检测时,获取待识别的语音信号,所述语音信号为短文本;对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到MFCC特征;将所述MFCC特征作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;从而实现了基于短文本的声纹检测,大大地缩小了模型的输入向量,解决了现有声纹检测方法中语音信号冗长、样本信息量大、运算资源要求高的问题。In summary, the embodiment of the present application redesigns the deep neural network suitable for short text in advance, and then uses the training samples of the short text to train the preset deep neural network; when performing voiceprint detection, obtain the A voice signal, the voice signal is a short text; the voice signal to be recognized is preprocessed, and the preprocessed voice signal is feature extracted to obtain the MFCC feature; the MFCC feature is passed as input A pre-trained deep neural network obtains the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the voice signal, and each element in the voiceprint vector represents the value of the voice signal Features; compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output the voiceprint detection result according to the comparison result; thus, the voiceprint detection based on short text is realized, greatly The input vector of the model is reduced, and the problems of long speech signal, large amount of sample information, and high computing resource requirements in the existing voiceprint detection methods are solved.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在一实施例中,提供一种基于短文本的声纹检测装置,该基于短文本的声纹检测装置与上述实施例中基于短文本的声纹检测方法一一对应。如图5所示,该基于短文本的声纹检测装置包括训练模块、信息获取模块、特征提取模块、特征获取模块、检测模块。各功能模块详细说明如下:In one embodiment, a short text-based voiceprint detection device is provided, and the short text-based voiceprint detection device corresponds to the short text-based voiceprint detection method in the foregoing embodiment. As shown in FIG. 5, the short text-based voiceprint detection device includes a training module, an information acquisition module, a feature extraction module, a feature acquisition module, and a detection module. The detailed description of each functional module is as follows:
训练模块51,用于训练模块,用于获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;The training module 51 is used for the training module, used to obtain training samples, and use the training samples to train a preset deep neural network;
信号获取模块52,用于获取待识别的语音信号;The signal acquisition module 52 is used to acquire the voice signal to be recognized;
特征提取模块53,用于对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;The feature extraction module 53 is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
特征获取模块54,用于将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The feature acquisition module 54 is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the speech signal The voiceprint vector of, where each element in the voiceprint vector represents the feature of the voice signal;
检测模块55,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;The detection module 55 is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
可选地,所述训练模块51包括:Optionally, the training module 51 includes:
样本获取单元,用于获取多个用户的语音样本作为训练样本;The sample acquisition unit is used to acquire voice samples of multiple users as training samples;
特征提取单元,用于对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;The feature extraction unit is configured to preprocess the training samples of each user, and perform feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
标签单元,用于对每一个所述用户的梅尔频率倒谱系数打上用户标签;The tag unit is used to tag the Mel frequency cepstrum coefficient of each user with a user tag;
训练单元,用于将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The training unit is used to input the Mel frequency cepstrum coefficients with user tags as input vectors into the preset deep neural network for training;
参数修改单元,用于采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;The parameter modification unit is used to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag using a preset loss function, and modify the Parameters of deep neural network;
所述训练单元还用于,将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The training unit is also used to pass the Mel frequency cepstrum coefficients with user labels as an input vector to the modified deep neural network for the next iterative training, until the deep neural network performs the next iteration of training for each Mel frequency The accuracy of the recognition result of the cepstral coefficient reaches the specified threshold, and the iteration is stopped.
可选地,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用dropout策略进行训练。Optionally, the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Dropout strategy for training.
可选地,所述特征提取模块53包括:Optionally, the feature extraction module 53 includes:
分帧单元,用于对所述待识别的语音信号的波形图执行分帧处理;The framing unit is configured to perform framing processing on the waveform diagram of the voice signal to be recognized;
加窗单元,用于在分帧处理之后,对每一帧信号执行加窗处理;The windowing unit is used to perform windowing processing on each frame of signal after framing processing;
变换单元,用于对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;A transforming unit for performing discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
功率谱计算单元,用于根据所有帧信号对应的频谱计算所述语音信号的功率谱;A power spectrum calculation unit, configured to calculate the power spectrum of the voice signal according to the spectrum corresponding to all frame signals;
滤波器组计算单元,用于根据所述功率谱计算梅尔滤波器组;A filter bank calculation unit for calculating a mel filter bank according to the power spectrum;
对数单元,用于对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Logarithmic unit, used to perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
余弦变换单元,用于对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The cosine transform unit is configured to perform discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the voice signal.
可选地,所述检测模块55包括:Optionally, the detection module 55 includes:
比对单元,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;The comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
第一结果输出单元,用于若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;The first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;
第二结果输出单元,用于若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。The second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
关于基于短文本的声纹检测装置的具体限定可以参见上文中对于基于短文本的声纹检测方法的限定,在此不再赘述。上述基于短文本的声纹检测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the voiceprint detection device based on short text, please refer to the above limitation on the voiceprint detection method based on short text, which will not be repeated here. Each module in the aforementioned short text-based voiceprint detection device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于短文本的声纹检测方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 6. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a short text-based voiceprint detection method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
获取待识别的语音信号;Obtain the voice signal to be recognized;
对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到MFCC特征;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain MFCC features;
将所述MFCC特征作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;Use the MFCC feature as input into a pre-trained deep neural network, and obtain the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the speech signal. In the voiceprint vector Each element of represents the characteristics of the voice signal;
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:In one embodiment, one or more non-volatile readable storage media storing computer readable instructions are provided. When the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
获取待识别的语音信号;Obtain the voice signal to be recognized;
对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到MFCC特征;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain MFCC features;
将所述MFCC特征作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;Use the MFCC feature as input into a pre-trained deep neural network, and obtain the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the speech signal. In the voiceprint vector Each element of represents the characteristics of the voice signal;
将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检 测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种基于短文本的声纹检测方法,其特征在于,包括:A voiceprint detection method based on short text, which is characterized in that it includes:
    获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
    获取待识别的语音信号;Obtain the voice signal to be recognized;
    对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
    将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
    其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
  2. 如权利要求1所述的基于短文本的声纹检测方法,其特征在于,所述获取训练样本,采用所述训练样本对预设的深度神经网络进行训练包括:The method for voiceprint detection based on short text according to claim 1, wherein said acquiring training samples and using said training samples to train a preset deep neural network comprises:
    获取多个用户的语音样本作为训练样本;Acquire voice samples of multiple users as training samples;
    对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
    对每一个所述用户的梅尔频率倒谱系数打上用户标签;Labeling a user tag on the Mel frequency cepstrum coefficient of each user;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;
    采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
  3. 如权利要求2所述的基于短文本的声纹检测方法,其特征在于,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。The voiceprint detection method based on short text according to claim 2, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
  4. 如权利要求1至3任一项所述的基于短文本的声纹检测方法,其特征在于,所述将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果包括:The voiceprint detection method based on short text according to any one of claims 1 to 3, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library , And output voiceprint detection results according to the comparison results, including:
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
    若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向 量对应的用户信息,输出所述用户信息;If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;
    若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
  5. 如权利要求1至3任一项所述的基于短文本的声纹检测方法,其特征在于,所述对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数包括:The voiceprint detection method based on short text according to any one of claims 1 to 3, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessed voice signal is performed Feature extraction, obtained Mel frequency cepstrum coefficients include:
    对所述待识别的语音信号的波形图执行分帧处理;Performing framing processing on the waveform diagram of the voice signal to be recognized;
    在分帧处理之后,对每一帧信号执行加窗处理;After framing processing, perform windowing processing on each frame of signal;
    对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
    根据所有帧信号对应的频谱计算所述语音信号的功率谱;Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;
    根据所述功率谱计算梅尔滤波器组;Calculating a mel filter bank according to the power spectrum;
    对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
    对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
  6. 一种基于短文本的声纹检测装置,其特征在于,包括:A voiceprint detection device based on short text, which is characterized in that it comprises:
    训练模块,用于获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;The training module is used to obtain training samples, and use the training samples to train a preset deep neural network;
    信号获取模块,用于获取待识别的语音信号;The signal acquisition module is used to acquire the voice signal to be recognized;
    特征提取模块,用于对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;The feature extraction module is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
    特征获取模块,用于将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The feature acquisition module is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the voice signal A voiceprint vector, where each element in the voiceprint vector represents a feature of the voice signal;
    检测模块,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;The detection module is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;
    其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
  7. 如权利要求6所述的基于短文本的声纹检测装置,其特征在于,所述训练模块包括:The voiceprint detection device based on short text according to claim 6, wherein the training module comprises:
    样本获取单元,用于获取多个用户的语音样本作为训练样本;The sample acquisition unit is used to acquire voice samples of multiple users as training samples;
    特征提取单元,用于对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;The feature extraction unit is configured to preprocess the training samples of each user, and perform feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
    标签单元,用于对每一个所述用户的梅尔频率倒谱系数打上用户标签;The tag unit is used to tag the Mel frequency cepstrum coefficient of each user with a user tag;
    训练单元,用于将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The training unit is used to input the Mel frequency cepstrum coefficients with user tags as input vectors into the preset deep neural network for training;
    参数修改单元,用于采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;The parameter modification unit is used to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag using a preset loss function, and modify the Parameters of deep neural network;
    所述训练单元还用于,将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The training unit is also used to pass the Mel frequency cepstrum coefficients with user labels as an input vector to the modified deep neural network for the next iterative training, until the deep neural network performs the next iteration of training for each Mel frequency The accuracy of the recognition result of the cepstral coefficient reaches the specified threshold, and the iteration is stopped.
  8. 如权利要求7所述的基于短文本的声纹检测装置,其特征在于,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。The voiceprint detection device based on short text according to claim 7, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
  9. 如权利要求6至8任一项所述的基于短文本的声纹检测装置,其特征在于,所述检测模块包括:The voiceprint detection device based on short text according to any one of claims 6 to 8, wherein the detection module comprises:
    比对单元,用于将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;The comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
    第一结果输出单元,用于若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;The first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;
    第二结果输出单元,用于若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。The second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
  10. 如权利要求6至8任一项所述的基于短文本的声纹检测装置,其特征在于,所述特征提取模块包括:The voiceprint detection device based on short text according to any one of claims 6 to 8, wherein the feature extraction module comprises:
    分帧单元,用于对所述待识别的语音信号的波形图执行分帧处理;The framing unit is configured to perform framing processing on the waveform diagram of the voice signal to be recognized;
    加窗单元,用于在分帧处理之后,对每一帧信号执行加窗处理;The windowing unit is used to perform windowing processing on each frame of signal after framing processing;
    变换单元,用于对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;A transforming unit for performing discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
    功率谱计算单元,用于根据所有帧信号对应的频谱计算所述语音信号的功率谱;A power spectrum calculation unit, configured to calculate the power spectrum of the voice signal according to the spectrum corresponding to all frame signals;
    滤波器组计算单元,用于根据所述功率谱计算梅尔滤波器组;A filter bank calculation unit for calculating a mel filter bank according to the power spectrum;
    对数单元,用于对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Logarithmic unit, used to perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
    余弦变换单元,用于对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The cosine transform unit is configured to perform discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the voice signal.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:
    获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
    获取待识别的语音信号;Obtain the voice signal to be recognized;
    对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
    将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后 一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstral coefficients are passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
    其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
  12. 如权利要求11所述的计算机设备,其特征在于,所述获取训练样本,采用所述训练样本对预设的深度神经网络进行训练包括:The computer device according to claim 11, wherein said obtaining training samples and using said training samples to train a preset deep neural network comprises:
    获取多个用户的语音样本作为训练样本;Acquire voice samples of multiple users as training samples;
    对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
    对每一个所述用户的梅尔频率倒谱系数打上用户标签;Labeling a user tag on the Mel frequency cepstrum coefficient of each user;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;
    采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
  13. 如权利要求12所述的计算机设备,其特征在于,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四全连接层采用丢弃策略进行训练。The computer device of claim 12, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, a maxout excitation function is used, and the third The fully connected layer and the fourth fully connected layer are trained using a discard strategy.
  14. 如权利要求11至13任一项所述的计算机设备,其特征在于,所述将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果包括:The computer device according to any one of claims 11 to 13, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library, and the comparison result is The output voiceprint detection results include:
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
    若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;
    若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
  15. 如权利要求11至13任一项所述的计算机设备,其特征在于,所述对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数包括:The computer device according to any one of claims 11 to 13, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessing is performed on the feature extraction of the voice signal to obtain Mel Frequency cepstrum coefficients include:
    对所述待识别的语音信号的波形图执行分帧处理;Performing framing processing on the waveform diagram of the voice signal to be recognized;
    在分帧处理之后,对每一帧信号执行加窗处理;After framing processing, perform windowing processing on each frame of signal;
    对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
    根据所有帧信号对应的频谱计算所述语音信号的功率谱;Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;
    根据所述功率谱计算梅尔滤波器组;Calculating a mel filter bank according to the power spectrum;
    对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
    对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取训练样本,采用所述训练样本对预设的深度神经网络进行训练;Obtaining training samples, and using the training samples to train a preset deep neural network;
    获取待识别的语音信号;Obtain the voice signal to be recognized;
    对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数;Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;
    将所述梅尔频率倒谱系数作为输入传入预先训练好的深度神经网络,获取所述深度神经网络在最后一层全连接层的输出向量,作为所述语音信号的声纹向量,所述声纹向量中的各个元素表示所述语音信号的特征;The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;
    其中,所述训练样本和语音信号均为短文本。Wherein, the training samples and speech signals are both short texts.
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述获取训练样本,采用所述训练样本对预设的深度神经网络进行训练包括:The non-volatile readable storage medium according to claim 16, wherein said obtaining training samples and using said training samples to train a preset deep neural network comprises:
    获取多个用户的语音样本作为训练样本;Acquire voice samples of multiple users as training samples;
    对每一个所述用户的训练样本进行预处理,对预处理后的训练样本进行特征提取,得到梅尔频率倒谱系数;Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;
    对每一个所述用户的梅尔频率倒谱系数打上用户标签;Labeling a user tag on the Mel frequency cepstrum coefficient of each user;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入预设的深度神经网络进行训练;The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;
    采用预设的损失函数计算每一所述梅尔频率倒谱系数经过所述深度神经网络的识别结果与对应的用户标签之间的误差,并根据所述误差修改所述深度神经网络的参数;Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;
    将带有用户标签的梅尔频率倒谱系数作为输入向量传入参数修改后的深度神经网络进行下一次迭代训练,直至所述深度神经网络对每一梅尔频率倒谱系数的识别结果的准确率达到指定阈值,停止迭代。The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述深度神经网络包括输入层、四层全连接层以及输出层,每一全连接层为12维输入,采用maxout激发函数,且第三全连接层和第四 全连接层采用丢弃策略进行训练。The non-volatile readable storage medium according to claim 17, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
  19. 如权利要求16至18任一项所述的非易失性可读存储介质,其特征在于,所述将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对,并根据比对结果输出声纹检测结果包括:The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library , And output voiceprint detection results according to the comparison results, including:
    将所述语音信号的声纹向量与声纹模型库中的预存声纹向量进行比对;Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;
    若所述声纹模型库中存在与所述语音信号的声纹向量相同的预存声纹向量时,获取所述预存声纹向量对应的用户信息,输出所述用户信息;If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;
    若所述声纹模型库中不存在与所述语音信号的声纹向量相同的预存声纹向量时,输出检测失败的提示信息。If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
  20. 如权利要求16至18任一项所述的非易失性可读存储介质,其特征在于,所述对所述待识别的语音信号进行预处理,并对预处理后的所述语音信号进行特征提取,得到梅尔频率倒谱系数包括:The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessed voice signal is performed Feature extraction, obtained Mel frequency cepstrum coefficients include:
    对所述待识别的语音信号的波形图执行分帧处理;Performing framing processing on the waveform diagram of the voice signal to be recognized;
    在分帧处理之后,对每一帧信号执行加窗处理;After framing processing, perform windowing processing on each frame of signal;
    对加窗处理后的每一帧信号执行离散傅里叶变换,得到该帧信号对应的频谱;Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;
    根据所有帧信号对应的频谱计算所述语音信号的功率谱;Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;
    根据所述功率谱计算梅尔滤波器组;Calculating a mel filter bank according to the power spectrum;
    对每一个所述梅尔滤波器的输出执行对数运算,得到对数能量;Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;
    对所述对数能量执行离散余弦变换,得到所述语音信号的梅尔频率倒谱系数。The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
PCT/CN2019/117731 2019-03-06 2019-11-13 Voiceprint detection method, apparatus and device based on short text, and storage medium WO2020177380A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910167882.3A CN110010133A (en) 2019-03-06 2019-03-06 Vocal print detection method, device, equipment and storage medium based on short text
CN201910167882.3 2019-03-06

Publications (1)

Publication Number Publication Date
WO2020177380A1 true WO2020177380A1 (en) 2020-09-10

Family

ID=67166562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117731 WO2020177380A1 (en) 2019-03-06 2019-11-13 Voiceprint detection method, apparatus and device based on short text, and storage medium

Country Status (2)

Country Link
CN (1) CN110010133A (en)
WO (1) WO2020177380A1 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text
CN110751944A (en) * 2019-09-19 2020-02-04 平安科技(深圳)有限公司 Method, device, equipment and storage medium for constructing voice recognition model
CN110570871A (en) * 2019-09-20 2019-12-13 平安科技(深圳)有限公司 TristouNet-based voiceprint recognition method, device and equipment
CN110767239A (en) * 2019-09-20 2020-02-07 平安科技(深圳)有限公司 Voiceprint recognition method, device and equipment based on deep learning
CN110880327A (en) * 2019-10-29 2020-03-13 平安科技(深圳)有限公司 Audio signal processing method and device
CN110875043B (en) * 2019-11-11 2022-06-17 广州国音智能科技有限公司 Voiceprint recognition method and device, mobile terminal and computer readable storage medium
CN111128234B (en) * 2019-12-05 2023-02-14 厦门快商通科技股份有限公司 Spliced voice recognition detection method, device and equipment
CN111145736B (en) * 2019-12-09 2022-10-04 华为技术有限公司 Speech recognition method and related equipment
CN111462757B (en) * 2020-01-15 2024-02-23 北京远鉴信息技术有限公司 Voice signal-based data processing method, device, terminal and storage medium
CN113223536B (en) * 2020-01-19 2024-04-19 Tcl科技集团股份有限公司 Voiceprint recognition method and device and terminal equipment
CN111227839B (en) * 2020-01-19 2023-08-18 中国电子科技集团公司电子科学研究院 Behavior recognition method and device
CN111326161B (en) * 2020-02-26 2023-06-30 北京声智科技有限公司 Voiceprint determining method and device
CN111341320B (en) * 2020-02-28 2023-04-14 中国工商银行股份有限公司 Phrase voice voiceprint recognition method and device
CN111341307A (en) * 2020-03-13 2020-06-26 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113470653A (en) * 2020-03-31 2021-10-01 华为技术有限公司 Voiceprint recognition method, electronic equipment and system
CN111583935A (en) * 2020-04-02 2020-08-25 深圳壹账通智能科技有限公司 Loan intelligent delivery method, device and storage medium
CN111326163B (en) * 2020-04-15 2023-02-14 厦门快商通科技股份有限公司 Voiceprint recognition method, device and equipment
CN111524522B (en) * 2020-04-23 2023-04-07 上海依图网络科技有限公司 Voiceprint recognition method and system based on fusion of multiple voice features
CN111488947B (en) * 2020-04-28 2024-02-02 深圳力维智联技术有限公司 Fault detection method and device for power system equipment
CN112185347A (en) * 2020-09-27 2021-01-05 北京达佳互联信息技术有限公司 Language identification method, language identification device, server and storage medium
CN112242137A (en) * 2020-10-15 2021-01-19 上海依图网络科技有限公司 Training of human voice separation model and human voice separation method and device
CN112259114A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voice processing method and device, computer storage medium and electronic equipment
CN112071322B (en) * 2020-10-30 2022-01-25 北京快鱼电子股份公司 End-to-end voiceprint recognition method, device, storage medium and equipment
CN112562691A (en) * 2020-11-27 2021-03-26 平安科技(深圳)有限公司 Voiceprint recognition method and device, computer equipment and storage medium
CN112562656A (en) * 2020-12-16 2021-03-26 咪咕文化科技有限公司 Signal classification method, device, equipment and storage medium
CN112802481A (en) * 2021-04-06 2021-05-14 北京远鉴信息技术有限公司 Voiceprint verification method, voiceprint recognition model training method, device and equipment
CN113407768B (en) * 2021-06-24 2024-02-02 深圳市声扬科技有限公司 Voiceprint retrieval method, voiceprint retrieval device, voiceprint retrieval system, voiceprint retrieval server and storage medium
CN114003885B (en) * 2021-11-01 2022-08-26 浙江大学 Intelligent voice authentication method, system and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150301796A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Speaker verification
CN105788592A (en) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 Audio classification method and apparatus thereof
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107808664A (en) * 2016-08-30 2018-03-16 富士通株式会社 Audio recognition method, speech recognition equipment and electronic equipment based on sparse neural network
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185379B (en) * 2015-06-17 2017-08-18 百度在线网络技术(北京)有限公司 voiceprint authentication method and device
CN105869644A (en) * 2016-05-25 2016-08-17 百度在线网络技术(北京)有限公司 Deep learning based voiceprint authentication method and device
WO2019023877A1 (en) * 2017-07-31 2019-02-07 深圳和而泰智能家居科技有限公司 Specific sound recognition method and device, and storage medium
CN108417217B (en) * 2018-01-11 2021-07-13 思必驰科技股份有限公司 Speaker recognition network model training method, speaker recognition method and system
CN108877812B (en) * 2018-08-16 2021-04-02 桂林电子科技大学 Voiceprint recognition method and device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150301796A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Speaker verification
CN105788592A (en) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 Audio classification method and apparatus thereof
CN107808664A (en) * 2016-08-30 2018-03-16 富士通株式会社 Audio recognition method, speech recognition equipment and electronic equipment based on sparse neural network
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107527620A (en) * 2017-07-25 2017-12-29 平安科技(深圳)有限公司 Electronic installation, the method for authentication and computer-readable recording medium
CN110010133A (en) * 2019-03-06 2019-07-12 平安科技(深圳)有限公司 Vocal print detection method, device, equipment and storage medium based on short text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何建超 (HE, JIANCHAO): "基于高层信息融合的短语音说话人识别方法研究 (The Research of High-Level Information Fusion Based Speaker Recognition Algorithm Using Short Utterance)", 中国优秀硕士学位论文全文数据库 (电子期刊) (CHINESE MASTER’S THESES FULL-TEXT DATABASE (ELECTRONIC JOURNALS)), 15 April 2017 (2017-04-15), XP055732004, ISSN: 1674-0246, DOI: 20200116100704A *

Also Published As

Publication number Publication date
CN110010133A (en) 2019-07-12

Similar Documents

Publication Publication Date Title
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
US20200321008A1 (en) Voiceprint recognition method and device based on memory bottleneck feature
Liu et al. An MFCC‐based text‐independent speaker identification system for access control
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
WO2020224114A1 (en) Residual delay network-based speaker confirmation method and apparatus, device and medium
WO2020244153A1 (en) Conference voice data processing method and apparatus, computer device and storage medium
WO2021000408A1 (en) Interview scoring method and apparatus, and device and storage medium
CN110378228A (en) Video data handling procedure, device, computer equipment and storage medium are examined in face
CN109346086A (en) Method for recognizing sound-groove, device, computer equipment and computer readable storage medium
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2021042537A1 (en) Voice recognition authentication method and system
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
Dawood et al. A robust voice spoofing detection system using novel CLS-LBP features and LSTM
Ismail et al. Development of a regional voice dataset and speaker classification based on machine learning
Karthikeyan Adaptive boosted random forest-support vector machine based classification scheme for speaker identification
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
CN113869212A (en) Multi-modal in-vivo detection method and device, computer equipment and storage medium
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
Kuznetsov et al. Methods of countering speech synthesis attacks on voice biometric systems in banking
Tai et al. Seef-aldr: A speaker embedding enhancement framework via adversarial learning based disentangled representation
Khanum et al. A novel speaker identification system using feed forward neural networks
Nguyen et al. Vietnamese speaker authentication using deep models
CN113178196B (en) Audio data extraction method and device, computer equipment and storage medium
Al-karawi Real-time adaptive training for forensic speaker verification in reverberation conditions
Revathi et al. Real time implementation of voice based robust person authentication using TF features and CNN

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19918346

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19918346

Country of ref document: EP

Kind code of ref document: A1