WO2021164256A1 - 语音信号处理方法、装置及设备 - Google Patents

语音信号处理方法、装置及设备 Download PDF

Info

Publication number
WO2021164256A1
WO2021164256A1 PCT/CN2020/118120 CN2020118120W WO2021164256A1 WO 2021164256 A1 WO2021164256 A1 WO 2021164256A1 CN 2020118120 W CN2020118120 W CN 2020118120W WO 2021164256 A1 WO2021164256 A1 WO 2021164256A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
voice signal
statistical
feature vector
sample
Prior art date
Application number
PCT/CN2020/118120
Other languages
English (en)
French (fr)
Inventor
王健宗
彭俊清
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021164256A1 publication Critical patent/WO2021164256A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • This application relates to the field of signal processing, and in particular to methods, devices and equipment for processing voice signals.
  • voiceprint recognition technology has been widely used in the field of remote unsupervised identity authentication.
  • the recording replay attack is a high-fidelity recording device. Record the voice of the target person, and then use the recorded voice signal to crack the technical means of the voiceprint authentication system. The voice of the voice replay attack comes from the speaker himself, so it is more authentic. This kind of attack will pose a greater threat to the security of the system.
  • the system will specify the text sentences that users need to read.
  • voice content recognition is supplemented by voice content recognition for recording replay detection.
  • the inventor found that when the user has a serious accent or has his own special pronunciation habits, the accuracy of speech content recognition is greatly reduced, which reduces the accuracy of recording and replaying speech signal detection.
  • embodiments of this application provide a speech signal processing method, device, equipment and The computer storage medium can improve the accuracy of recording and replaying signal detection, and does not need to detect the content of the voice signal, which improves the detection efficiency.
  • an embodiment of the present application provides a voice signal processing method, including: obtaining a first statistical feature vector corresponding to a voice signal to be processed, where the first statistical feature vector is used to indicate the M-dimensional feature of the voice signal to be processed The statistical value of each dimension of the feature space in the space, where the M is an integer greater than 1; the first statistical feature vector is input into the first model for processing to obtain a second statistical feature vector, and the first model is used to Processing the first statistical feature vector according to the importance of each feature space in the M-dimensional feature space; determining the target category of the speech signal to be processed according to the second statistical feature vector, the target category including the original Voice signal or recording and replaying voice signal.
  • an embodiment of the present application provides a voice signal processing device, including: a first feature acquisition module, configured to acquire a first statistical feature vector corresponding to the voice signal to be processed, and the first statistical feature vector is used to represent all The statistical value of each dimensional feature space of the speech signal to be processed in the M-dimensional feature space, where M is an integer greater than 1; the second feature acquisition module is used to input the first statistical feature vector into the first model for processing , Obtain a second statistical feature vector, the first model is used to process the first statistical feature vector according to the importance of each dimension of the feature space in the M-dimensional feature space; the target category determination module is used to process the first statistical feature vector according to the The second statistical feature vector determines a target category of the voice signal to be processed, and the target category includes an original voice signal or a recorded and replayed voice signal.
  • the embodiments of the present application provide a voice signal processing device, including a processor, a memory, and an input and output interface.
  • the processor, the memory, and the input and output interface are connected to each other, wherein the input and output interface is used for input Or output data, the memory is used to store the application program code of the voice signal processing device to execute the above method, and the processor is configured to perform the following steps: acquiring the first statistical feature vector corresponding to the voice signal to be processed, and the first statistical feature vector A statistical feature vector is used to represent the statistical value of each dimensional feature space of the speech signal to be processed in the M-dimensional feature space, where M is an integer greater than 1; the first statistical feature vector is input to the first model for processing , Obtain a second statistical feature vector, the first model is used to process the first statistical feature vector according to the importance of each dimension of the feature space in the M-dimensional feature space; according to the second statistical feature vector, Determine a target category of the voice signal to be processed, where the target category includes an original voice signal or a
  • an embodiment of the present application provides a computer storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute The following steps: Obtain a first statistical feature vector corresponding to the voice signal to be processed, where the first statistical feature vector is used to represent the statistical value of each feature space of the voice signal to be processed in an M-dimensional feature space, where M is An integer greater than 1; the first statistical feature vector is input to a first model for processing to obtain a second statistical feature vector, and the first model is used to adjust the importance of each dimensional feature space in the M-dimensional feature space The first statistical feature vector is processed; the target category of the voice signal to be processed is determined according to the second statistical feature vector, and the target category includes an original voice signal or a recorded and replayed voice signal.
  • the first statistical feature vector corresponding to the speech signal to be processed is obtained, the first statistical feature vector is input into the first model for processing, the second statistical feature vector is obtained, and the second statistical feature vector is determined to be processed according to the second statistical feature vector.
  • the target category of the speech signal so as to determine whether the speech signal to be processed is the original speech signal or the recorded and reproduced speech signal. Since the first statistical feature vector is processed according to the importance of each dimension of the feature space in the M-dimensional feature space, the M-dimensional feature space is strengthened.
  • the statistical features of each dimension of the feature space in the feature space can more accurately reflect the statistical features of the voice signal to be processed, thereby accurately determining the target category of the voice signal to be processed, improving the accuracy of recording and replaying detection, and does not need to be treated Process the content of the voice signal for detection, improve detection efficiency, and have strong applicability.
  • FIG. 1 is a schematic flowchart of a voice signal processing method provided by an embodiment of the present application.
  • Fig. 2 is a schematic flowchart of another voice signal processing method provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of training a first model provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another voice signal processing method provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a training coding model and a decoding model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the composition structure of a voice signal processing device provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the composition structure of a voice signal processing device provided by an embodiment of the present application.
  • the solution of the embodiment of the present application is suitable for processing the voice signal to determine whether the target category of the voice signal belongs to the recording and replaying voice signal category.
  • the first statistical feature vector is input to the first model for processing, and the second statistical feature vector is obtained, and the target category of the voice signal to be processed is determined according to the second statistical feature vector, so as to determine that the voice signal to be processed is the original voice signal or the recorded and replayed voice Signal, because the first statistical feature vector is processed according to the importance of each dimension of the feature space in the M-dimensional feature space, the statistical features of each dimension of the feature space in the M-dimensional feature space are strengthened, which can more accurately reflect the voice signal to be processed The statistical features can accurately determine the target category of the voice signal to be processed, improve the accuracy of recording and playback detection, and do not need to detect the content of the voice signal to be processed, improve the detection efficiency, and have strong applicability.
  • Fig. 1 is a schematic flowchart of a voice signal processing method provided by an embodiment of the present application. As shown in the figure, the method includes S101-S103.
  • S101 Obtain a first statistical feature vector corresponding to the voice signal to be processed, where the first statistical feature vector is used to represent the statistical value of each dimension of the feature space of the voice signal to be processed in the M-dimensional feature space, and M is an integer greater than 1.
  • the embodiments of this application can be applied to a voiceprint recognition and authentication system, that is, the user’s identity is determined by detecting the voiceprint of a voice signal to be processed.
  • the voice signal to be processed may be a voice signal used for voiceprint recognition. Voiceprint recognition is performed on the recorded and replayed voice signal of each user, so the embodiment of the present application needs to perform recording and replay detection on the voice signal to be processed.
  • the voice signal to be processed may be an original voice signal or a recording and replaying voice signal
  • the original voice signal may be a voice signal generated by the user directly vocalizing (for example, speaking) (that is, recording and replaying without equipment such as recording and recording) Voice signal)
  • the recording and replaying voice signal may include the voice signal obtained by recording the voice signal generated by the user's direct utterance, or the voice signal synthesized by means of signal synthesis and other methods that are not generated by the user's direct utterance, and so on.
  • all voice signals other than the original voice signal are called recording and replaying voice signals.
  • the first statistical feature vector includes a first mean vector and/or a first standard deviation vector.
  • the first mean vector is used to represent the mean value of each dimension of the feature space in the M-dimensional feature space of the speech signal to be processed
  • the first standard deviation vector is used for Represents the standard deviation of each dimensional feature space in the M-dimensional feature space of the speech signal to be processed.
  • specifically acquiring the first statistical feature vector corresponding to the voice signal to be processed may include the following steps 1 to 4.
  • the voice signal to be processed can be sampled in a preset sampling period, and the continuous voice signal to be processed can be transformed into a discretized voice signal.
  • the sampling period can be a period determined according to the Nyquist sampling theorem;
  • is the pre-emphasis coefficient, and ⁇ is greater than 0.9 and less than 1; finally, you can use
  • the window function performs framing processing on the discrete speech signal to obtain multiple speech frames, where N speech frames are obtained, where the window function can be any one of a rectangular window, a Hamming window or a Hanning window.
  • endpoint detection can be performed by means such as energy-based endpoint detection, information entropy-based endpoint detection, or frequency band variance-based endpoint detection.
  • the first feature vector of each speech frame is a feature vector with a 400-dimensional feature space, that is, the first feature vector is used to represent the speech frame In the 400-dimensional feature space, the eigenvalues of each dimensional feature space are obtained 100 first feature vectors of 400 dimensionality.
  • linear prediction cepstral coefficients LPCC
  • Mel-scale frequency cepstral coefficients MFCC
  • CQCC constant Q cepstral coefficients
  • the voice signal corresponding to each voice frame in the N voice frames can be first subjected to constant Q transformation (Constant Q transform).
  • Transform, CQT transform the time domain signal into the frequency domain signal; secondly, calculate the energy spectrum of each speech frame in N speech frames, and take the logarithm of the energy spectrum to obtain the logarithmic energy spectrum; finally, the logarithmic energy spectrum Perform uniform resampling to obtain the sampling function, and then perform Discrete Consine Transform (DCT) on the sampling function to obtain the CQCC feature vector, that is, the first feature vector. From this, the first feature vector of each of the N voice frames can be obtained.
  • Feature vector Feature vector.
  • the statistical value corresponding to the dimensional feature space is calculated, and the statistic value corresponding to the dimensional feature space is the statistical value of the N speech frames in the dimensional feature space.
  • the statistical value of the N speech frames in the dimensional feature space is calculated. For example, if M is 400 and N is 100, for each dimensional feature space in the 400-dimensional feature space, the statistical value of 100 speech frames in the dimensional feature space is calculated.
  • the statistical value may include the mean value and/or standard deviation, that is, for each dimensional feature space in the M-dimensional feature space, calculate the mean value of the N speech frames in the dimensional feature space to obtain the M-dimensional mean vector; for the M-dimensional feature space For each dimension feature space in, calculate the standard deviation of N speech frames in this dimension feature space to obtain an M-dimensional standard deviation vector. For example, for each dimensional feature space in a 400-dimensional feature space, calculate the mean value and/or standard deviation of 100 speech frames in the dimensional feature space to obtain a 400-dimensional mean vector and/or 400-dimensional standard deviation vector.
  • the first statistical feature vector includes a first mean vector and/or a first standard deviation vector.
  • the first mean vector is used to represent the mean value of each dimension of the feature space in the M-dimensional feature space of the speech signal to be processed
  • the first standard deviation vector It is used to represent the standard deviation of each dimensional feature space in the M-dimensional feature space of the speech signal to be processed.
  • the statistical value includes the mean value and the standard deviation
  • construct the first statistical feature vector corresponding to the speech signal to be processed namely: according to each dimensional feature in the M-dimensional feature space
  • the space corresponding to the mean value, the first mean value vector corresponding to the speech signal to be processed is constructed, and the first standard deviation vector corresponding to the speech signal to be processed is constructed according to the standard deviation corresponding to each dimension of the feature space in the M-dimensional feature space.
  • the first mean vector is a vector with M-dimensional feature space composed of M mean values
  • the first standard deviation vector is a vector with M-dimensional feature space composed of M standard deviations.
  • S102 Input the first statistical feature vector into the first model for processing to obtain a second statistical feature vector.
  • the first model is used to process the first statistical feature vector according to the importance of each dimension of the feature space in the M-dimensional feature space.
  • the first statistical feature vector can be processed by the weight module in the first model to obtain the second statistical feature vector.
  • the weight module includes a target weight matrix.
  • the target weight matrix can be a matrix with an M-dimensional feature space, where the value corresponding to each dimensional feature space in the M-dimensional feature space is used to indicate the importance of the dimensional feature space, that is, the feature The larger the value corresponding to the space, the higher the importance of the feature space of this dimension; the smaller the value corresponding to the feature space, the lower the importance of the feature space of this dimension.
  • the target weight matrix can be obtained by assigning weights to each dimensional feature space according to the target rule according to the mean value of each dimensional feature space in the M-dimensional feature space of the first mean vector, that is, the target weight matrix has an M-dimensional feature space Of the matrix.
  • the target rule can be that if the mean value of a certain dimension in the M-dimensional feature space of the first mean vector is large, then the weight of this dimension feature space is large; the mean value of a certain dimension in the M-dimensional feature space of the first mean vector is small, then The weight of the feature space of this dimension is small, that is, in the M-dimensional feature space of the first mean vector, the feature space with the larger mean value has the larger weight, and the smaller the mean value, the weight of the feature space is smaller.
  • the first model can be trained in advance, so that the second statistical feature vector processed by using the trained first model can more accurately represent the category of the speech signal to be processed, and specifically train the first model Refer to the description in the embodiment corresponding to FIG. 3 for the process, which is not described here too much.
  • S103 Determine a target category of the voice signal to be processed according to the second statistical feature vector, where the target category includes the original voice signal or the recorded and replayed voice signal.
  • the original voice signal may be a voice signal generated by the user's direct utterance (that is, a voice signal that has not been recorded and reproduced through equipment such as audio and video recording).
  • the first statistical feature vector includes the first mean vector and the first standard deviation vector
  • the second statistical feature vector includes the second mean vector and the second standard deviation vector
  • the second mean vector is based on the first mean vector and the first standard deviation vector.
  • the first model is obtained
  • the second standard deviation vector is obtained according to the first standard vector and the first model.
  • the second mean vector may be the product of the first mean vector and the target weight matrix
  • the second standard vector may be the product of the first standard vector and the target weight matrix.
  • a third statistical feature vector may be constructed first based on the second mean vector and the second standard deviation vector; then the target category of the voice signal to be processed can be determined based on the third statistical feature vector.
  • the third statistical feature vector can be obtained by concatenating the second mean vector and the second standard deviation vector. Since the second mean vector is a vector with an M-dimensional feature space, the second standard deviation vector is a vector with an M-dimensional feature space.
  • the third statistical feature vector obtained by splicing may be a vector with a 2M-dimensional feature space, that is, the third statistical feature vector is a 2M-dimensional feature vector.
  • the third statistical feature vector may be reduced by the dimensionality reduction module to obtain a two-dimensional feature vector, thereby determining the target category of the voice signal to be processed according to the two-dimensional feature vector.
  • the corresponding relationship between the two-dimensional feature vector and the voice signal category can be set in advance. Then, when the third statistical feature vector is reduced in dimensionality to obtain the two-dimensional feature vector, the obtained two-dimensional feature vector and the two The correspondence between the two-dimensional feature vector and the voice signal category determines the voice signal category corresponding to the two-dimensional feature vector, thereby determining the target category of the voice signal to be processed.
  • the first statistical feature vector corresponding to the speech signal to be processed is obtained, the first statistical feature vector is input into the first model for processing, the second statistical feature vector is obtained, and the second statistical feature vector is determined to be processed according to the second statistical feature vector.
  • the target category of the speech signal so as to determine whether the speech signal to be processed is the original speech signal or the recorded and reproduced speech signal. Since the first statistical feature vector is processed according to the importance of each dimension of the feature space in the M-dimensional feature space, the M-dimensional feature space is strengthened.
  • the statistical features of each dimension of the feature space in the feature space can more accurately reflect the statistical features of the voice signal to be processed, thereby accurately determining the target category of the voice signal to be processed, improving the accuracy of recording and replaying detection, and does not need to be treated Process the content of the voice signal for detection, improve detection efficiency, and have strong applicability.
  • the first statistical feature vector is input to the first model for processing (using the first model).
  • a large number of sample speech signals can also be used to train the first model, and the first model can be adjusted according to the training loss value, so that the second statistical feature vector processed by the first model after training can be A more accurate representation of the type of voice signal to be processed.
  • Figure 2 is a schematic flowchart of another voice signal processing method provided by an embodiment of the present application. As shown in the figure, this Methods include S201-S204.
  • the first sample speech signal is a speech signal prepared for training the first model.
  • the first sample voice signal may be obtained by recording the original voice signal, or may be obtained by recording the recording and replaying voice signal.
  • the target category of the first sample voice signal may be determined, that is, the first sample voice signal is determined in advance before the first sample voice signal is input into the first model for processing
  • the signal belongs to the original voice signal or the recorded and replayed voice signal.
  • the target category of the first sample voice signal 1, the first sample voice signal 2, the first sample voice signal 3 may be recorded in advance, for example, the first sample voice signal 1, the first sample voice signal 2, the first sample voice signal If the target categories of the sample voice signal 3 are original voice signal, original voice signal, recording and playback voice signal, the first sample voice signal 1 ⁇ original voice signal, and the first sample voice signal 2 ⁇ original voice can be recorded. Signal, first sample voice signal 3 ⁇ recording and replaying voice signal, etc.
  • the first sample statistical feature vector includes the first sample mean vector and/or the first sample standard deviation vector, and the first sample mean vector is used to represent the feature space of each dimension of the first sample speech signal in the M-dimensional feature space
  • the mean value of the first sample standard deviation vector is used to represent the standard deviation of each dimensional feature space of the first sample speech signal in the M-dimensional feature space.
  • S202 Input the statistical feature vector of the first sample into the first model for processing, and obtain the statistical feature vector of the second sample.
  • the first sample statistical feature vector includes the first sample mean vector and the first sample standard deviation vector
  • the second sample statistical feature vector includes the second sample mean vector and the second sample standard deviation vector
  • the second sample mean vector The vector is obtained based on the first sample mean vector and the first model
  • the second sample standard deviation vector is obtained based on the first sample standard vector and the first model.
  • FIG. 3 is a schematic diagram of training the first model provided by an embodiment of the present application. , As shown in the figure: Obtain the first sample statistical feature vector corresponding to the first sample speech signal, input the first sample statistical feature vector into the first model, and use the weight module in the first model to perform statistics on the first sample The feature vector performs weight calculation to obtain the second sample statistical feature vector.
  • the third sample statistical feature vector can also be obtained according to the second sample statistical feature vector, and the third sample statistical feature vector is reduced by the dimensionality reduction module to obtain a two-dimensional sample feature vector, the two-dimensional sample feature vector Corresponds to a target category.
  • the weight module includes a target weight matrix.
  • the target weight matrix is used to express the importance of each dimension feature space in the M-dimensional feature space.
  • This statistical feature vector is weighted to obtain the third sample statistical feature vector;
  • the dimensionality reduction module may include a fully connected layer to reduce the amount of calculation in the training of the first model, for example, the obtained third sample statistical feature vector is a high-dimensional feature matrix .
  • the high-dimensional feature matrix is 2M-dimensional
  • the two-dimensional low-dimensional feature matrix can be obtained through the dimensionality reduction module. It can reduce the amount of calculation in model training.
  • S203 Calculate the first loss of the first model according to the statistical feature vector of the second sample.
  • calculating the first loss of the first model based on the statistical feature vector of the second sample is to calculate the first loss of the first model based on the first sample mean vector and the first sample standard deviation vector.
  • the target category of the first sample speech signal is predetermined, and the first sample statistical feature vector corresponding to the first sample speech signal is processed through the first model to obtain the second sample statistical feature vector, and According to the second sample statistical feature vector, the third sample statistical feature vector is obtained, and the third sample statistical feature vector is reduced by the dimensionality reduction module.
  • the obtained two-dimensional sample feature vector corresponds to a target category, according to the predetermined first
  • the similarity between the target category of the sample voice signal and the target category corresponding to the two-dimensional sample feature vector is calculated, and the first loss of the first model is calculated.
  • the first model can be adjusted by the gradient descent method, that is, the weight module in the first model can be adjusted, and the dimensionality reduction module can also be adjusted by the gradient descent method, so that The parameters of model training and the parameters in the dimensionality reduction module are more accurate, so that the second statistical feature vector obtained after processing by the first model can more accurately reflect the category of the first sample speech signal.
  • a large number of sample voice signals are used to train the first model, and the target category of each sample voice signal determined in advance is similar to the target category of the sample voice signal processed by the first model.
  • Determine the first loss of the first model and determine whether the first model is accurate according to the first loss.
  • adjust the first model so that the first model processed through training is obtained
  • the second statistical feature vector more accurately represents the target category of the sample voice signal. Because a large number of sample voice signals are used to train the first model, the first model after training is more accurate, which makes the recording and replay detection results more accurate .
  • S301 Acquire a first voice signal, where the first voice signal is a recording and replaying voice signal.
  • S302 Obtain a second feature vector of the first voice signal, and input the second feature vector into the coding model for coding processing to obtain a fourth statistical feature vector, which is used to represent the statistical feature of the first voice signal.
  • the second feature vector may include an LPCC feature vector, an MFCC feature vector, or a CQCC feature vector
  • the second feature vector can be obtained by performing LPCC feature extraction, MFCC feature extraction, or CQCC feature extraction on the first speech signal.
  • the second feature vector is obtained by feature extraction of the first speech signal, and the second feature vector is input into the coding model for coding processing to obtain the fourth statistical feature vector, and the fourth statistical feature vector includes the third mean value Vector and the third standard deviation vector.
  • the third mean vector is used to represent the mean value of each dimension of the first speech signal in the M-dimensional feature space.
  • the third standard deviation vector is used to represent the first speech signal in the M-dimensional feature space. The standard deviation of each dimension of the feature space.
  • the target condition is that the similarity between the second speech signal and the first speech signal satisfies the similarity threshold.
  • the similarity threshold can be 80%, 90%, 95%, etc., that is, the first speech signal and the second speech signal.
  • the voice signals are two voice signals with high similarity, that is, in this way, a recording and replaying voice signal can be used to generate a recording and replaying voice signal with high similarity to the recording and replaying voice signal.
  • there is X If there are two recording and replaying voice signals, 2X recording and replaying voice signals can be generated in the above manner, and the 2X recording and replaying voice signals are further used to train the first model.
  • constructing the first implicit vector based on the fourth statistical feature vector means constructing the first implicit vector based on the third mean vector and the third standard deviation vector, and inputting the first implicit vector into the decoding model for decoding processing to obtain the third implicit vector.
  • Feature vector
  • the encoding model and the decoding model may be the encoding layer and the decoding layer in a Variational Autoencoder (VAE).
  • VAE Variational Autoencoder
  • the encoding model and the decoding model can be trained in advance, This makes the trained coding model and the decoding model more accurate, so that the obtained second speech signal is more similar to the first speech signal corresponding to the second feature vector in the input coding model.
  • FIG. 5 is a schematic diagram of training the coding model and the decoding model according to an embodiment of the present application.
  • step S302 For a specific method of obtaining the first sample feature vector corresponding to the second sample speech signal, reference may be made to the method of obtaining the second feature vector of the first speech signal in step S302, which will not be repeated here.
  • the third sample statistical feature vector includes the second sample mean vector and the second sample standard deviation vector.
  • the second sample mean vector is used to represent the mean value of each dimension of the second sample speech signal in the M-dimensional feature space.
  • the second The sample standard deviation vector is used to represent the standard deviation of each dimensional feature space of the second sample speech signal in the M-dimensional feature space. That is, the first sample feature vector is input to the coding model for coding processing, and the second sample mean vector and the second sample standard deviation vector are obtained.
  • the first normal distribution function can be determined according to the second sample mean vector and the second sample standard deviation vector, and then the second loss can be determined according to the degree of coincidence between the first normal distribution function and the standard normal distribution function, where , The higher the degree of coincidence between the first normal distribution function and the standard normal distribution function, the smaller the second loss, the lower the degree of coincidence between the first normal distribution function and the standard normal distribution function, and the second loss The larger the value, the second loss can be the divergence loss.
  • the degree of coincidence between the first normal distribution function and the standard normal distribution function is the graph corresponding to the first normal distribution function on the coordinate axis and the The degree of coincidence between the graphs corresponding to the upper standard normal distribution function.
  • the first sample implicit vector can be obtained by multiplying the standard normal distribution function by the second sample standard deviation vector, and adding the second sample mean vector, that is, the first sample implicit vector can be a standard normal The product of the distribution function and the second sample standard deviation vector, and the sum of the second sample mean vector.
  • the implicit vector of the first sample is input into the decoding model for decoding processing to obtain the feature vector of the second sample.
  • the second sample feature vector is a feature vector constructed after inputting the first sample feature vector into the encoding model and the decoding model.
  • the third loss can be determined according to the similarity between the first sample feature vector and the second sample feature vector, that is, the higher the similarity between the first sample feature vector and the second sample feature vector, the smaller the third loss; The lower the similarity between the feature vector of the first sample and the feature vector of the second sample, the greater the third loss, where the third loss may be a cross-entropy loss.
  • the gradient descent method can be used to adjust the parameters in the encoding model.
  • the gradient descent method can be used to adjust the parameters in the decoding model to make the adjustment.
  • the first sample voice signal used for training the first model is a recording and replaying voice signal
  • the first sample voice signal is the first voice signal or the second voice signal.
  • the first voice signal and the second voice signal are both recorded and reproduced voice signals
  • the first voice signal is the voice signal corresponding to the first sample feature vector input to the coding model
  • the second voice signal is the output from the decoding model
  • the recorded and replayed voice signal may include a transcribed voice signal
  • the transcribed voice signal is a voice signal obtained by transcribing the recorded and replayed voice signal through an encoding model and a decoding model, that is, the second voice signal.
  • the recording and replaying the voice signal also includes the transcription voice signal, that is, the transcription obtained by inputting the statistical feature vector corresponding to the transcription voice signal into the first model for processing
  • the target category of the voice signal is recording and replaying the voice signal.
  • FIG. 6 is a schematic diagram of the composition structure of a voice signal processing device provided by an embodiment of the present application.
  • the device 60 includes: a first feature acquisition module 601 for acquiring a first statistical feature vector corresponding to a voice signal to be processed
  • the first statistical feature vector is used to represent the statistical value of each dimensional feature space of the speech signal to be processed in the M-dimensional feature space, and the M is an integer greater than 1.
  • the voice signal to be processed may be an original voice signal or a recording and replaying voice signal
  • the original voice signal may be a voice signal generated by the user directly vocalizing (for example, speaking) (that is, recording and replaying without equipment such as recording and recording) Voice signal)
  • the recording and replaying voice signal may include the voice signal obtained by recording the voice signal generated by the user's direct utterance, or the voice signal synthesized by means of signal synthesis and other methods that are not generated by the user's direct utterance, and so on.
  • all voice signals other than the original voice signal are called recording and replaying voice signals.
  • the first statistical feature vector includes a first mean vector and/or a first standard deviation vector.
  • the first mean vector is used to represent the mean value of each dimension of the feature space in the M-dimensional feature space of the speech signal to be processed
  • the first standard deviation vector is used for Represents the standard deviation of each dimensional feature space in the M-dimensional feature space of the speech signal to be processed.
  • the second feature acquisition module 602 is configured to input the first statistical feature vector into a first model for processing to obtain a second statistical feature vector, and the first model is used to obtain a second statistical feature vector based on each dimension of the M-dimensional feature space. Processing the first statistical feature vector.
  • the second feature acquisition module 602 may process the first statistical feature vector through the weight module in the first model to obtain the second statistical feature vector.
  • the weight module includes a target weight matrix.
  • the target weight matrix can be a matrix with an M-dimensional feature space, where the value corresponding to each dimensional feature space in the M-dimensional feature space is used to indicate the importance of the dimensional feature space, that is, the feature The larger the value corresponding to the space, the higher the importance of the feature space of this dimension; the smaller the value corresponding to the feature space, the lower the importance of the feature space of this dimension.
  • the target weight matrix can be obtained by assigning weights to each dimensional feature space according to the target rule according to the mean value of each dimensional feature space in the M-dimensional feature space of the first mean vector, that is, the target weight matrix has an M-dimensional feature space Of the matrix.
  • the target rule can be that if the mean value of a certain dimension in the M-dimensional feature space of the first mean vector is large, then the weight of this dimension feature space is large; the mean value of a certain dimension in the M-dimensional feature space of the first mean vector is small, then The weight of the feature space of this dimension is small, that is, in the M-dimensional feature space of the first mean vector, the feature space with the larger mean value has the larger weight, and the smaller the mean value, the weight of the feature space is smaller.
  • the first model can be trained in advance, so that the second statistical feature vector processed by using the trained first model can more accurately represent the category of the speech signal to be processed, and specifically train the first model Refer to the description in the embodiment corresponding to FIG. 3 for the process, which is not described here too much.
  • the target category determining module 603 is configured to determine the target category of the voice signal to be processed according to the second statistical feature vector, where the target category includes an original voice signal or a recorded and replayed voice signal.
  • the original voice signal may be a voice signal generated by the user's direct utterance (that is, a voice signal that has not been recorded and reproduced through equipment such as audio and video recording).
  • the first feature acquisition module 601 is configured to: divide the speech signal to be processed into N speech frames, where N is an integer greater than or equal to 1;
  • the first feature acquisition module 601 may sample the to-be-processed voice signal at a preset sampling period, and convert the continuous to-be-processed voice signal into a discretized voice signal.
  • the sampling period may be determined according to the Nyquist sampling theorem.
  • is the pre-emphasis coefficient, and ⁇ is greater than 0.9 and less than 1;
  • a window function can be used to frame the discrete speech signal to obtain multiple speech frames.
  • N speech frames are obtained.
  • the window function can be any of a rectangular window, a Hamming window or a Hanning window.
  • a window function can be any of a rectangular window, a Hamming window or a Hanning window.
  • the first feature acquisition module 601 can also eliminate noise and interference in the voice frame through endpoint detection.
  • endpoint detection can be performed by means such as energy-based endpoint detection, information entropy-based endpoint detection, or frequency band variance-based endpoint detection.
  • the first feature vector of each speech frame is a feature vector with a 400-dimensional feature space, that is, the first feature vector is used to represent the speech frame In the 400-dimensional feature space, the eigenvalues of each dimensional feature space are obtained 100 first feature vectors of 400 dimensionality.
  • the first feature acquisition module 601 can perform linear prediction cepstral coefficients (LPCC) feature extraction, and Mel-scale frequency cepstral coefficients (Mel-scale frequency cepstral coefficients) for each of the N speech frames. , MFCC) feature extraction, or constant Q cepstral coefficients (CQCC) feature extraction to obtain the first feature vector.
  • LPCC linear prediction cepstral coefficients
  • Mel-scale frequency cepstral coefficients Mel-scale frequency cepstral coefficients
  • CQCC constant Q cepstral coefficients
  • the first feature acquisition module 601 may first perform the voice signal corresponding to each of the N voice frames.
  • Constant Q Transform transforms time domain signals into frequency domain signals; secondly, calculate the energy spectrum of each speech frame in N speech frames, and take the logarithm of the energy spectrum to obtain the logarithmic energy spectrum; and finally , Uniformly resampling the logarithmic energy spectrum to obtain the sampling function, and then performing Discrete Consine Transform (DCT) on the sampling function to obtain the CQCC eigenvector, that is, the first eigenvector, so that each of the N speech frames can be obtained The first feature vector of a speech frame.
  • DCT Discrete Consine Transform
  • a statistical value corresponding to the dimensional feature space is calculated, where the statistical value is the statistical value of the N speech frames in the dimensional feature space.
  • the first feature acquisition module 601 calculates the statistical value of N speech frames in the feature space of each dimension in the feature space of M dimensions. For example, if M is 400 and N is 100, for each dimensional feature space in the 400-dimensional feature space, the statistical value of 100 speech frames in the dimensional feature space is calculated.
  • the statistical value may include the mean value and/or standard deviation, that is, the first feature acquisition module 601 calculates the mean value of N speech frames in the dimensional feature space for each dimensional feature space in the M-dimensional feature space to obtain the M-dimensional mean vector ;
  • the first feature acquisition module 601 calculates the standard deviation of the N speech frames in the feature space for each dimension in the M-dimensional feature space to obtain an M-dimensional standard deviation vector. For example, for each dimensional feature space in a 400-dimensional feature space, calculate the mean value and/or standard deviation of 100 speech frames in the dimensional feature space to obtain a 400-dimensional mean vector and/or 400-dimensional standard deviation vector.
  • the first statistical feature vector includes a first mean vector and/or a first standard deviation vector.
  • the first mean vector is used to represent the mean value of each dimension of the feature space in the M-dimensional feature space of the speech signal to be processed
  • the first standard deviation vector It is used to represent the standard deviation of each dimensional feature space in the M-dimensional feature space of the speech signal to be processed.
  • the first feature acquisition module 601 constructs the first statistical feature vector corresponding to the voice signal to be processed according to the statistical value corresponding to each dimension of the feature space in the M-dimensional feature space, namely: the first feature The acquisition module 601 constructs the first mean vector corresponding to the voice signal to be processed according to the mean value corresponding to each dimensional feature space in the M-dimensional feature space, and constructs the voice signal to be processed according to the standard deviation corresponding to each dimensional feature space in the M-dimensional feature space The corresponding first standard deviation vector.
  • the first mean vector is a vector with M-dimensional feature space composed of M mean values
  • the first standard deviation vector is a vector with M-dimensional feature space composed of M standard deviations.
  • the first statistical feature vector includes a first mean vector and/or a first standard deviation vector
  • the first mean vector is used to indicate that the speech signal to be processed is in an M-dimensional feature space
  • the mean value of each dimension of the feature space, and the first standard deviation vector is used to represent the standard deviation of each dimension of the feature space in the M-dimensional feature space of the speech signal to be processed.
  • the first statistical feature vector includes the first mean vector and the first standard deviation vector
  • the second statistical feature vector includes a second mean vector and a second standard deviation vector
  • the second mean vector is obtained according to the first mean vector and the first model
  • the second standard deviation vector is obtained according to the first standard vector and the first model.
  • the first statistical feature vector includes the first mean vector and the first standard deviation vector
  • the second statistical feature vector includes the second mean vector and the second standard deviation vector
  • the second mean vector is based on the first mean vector and the first standard deviation vector.
  • the first model is obtained
  • the second standard deviation vector is obtained according to the first standard vector and the first model.
  • the second mean vector may be the product of the first mean vector and the target weight matrix
  • the second standard vector may be the product of the first standard vector and the target weight matrix.
  • the target category determining module 603 is further configured to construct a third statistical feature vector according to the second mean vector and the second standard deviation vector.
  • the target category determination module 603 can obtain the third statistical feature vector by concatenating the second mean vector and the second standard deviation vector. Since the second mean vector is a vector with an M-dimensional feature space, the second standard deviation vector is For a vector in an M-dimensional feature space, the third statistical feature vector obtained by splicing may be a vector with a 2M-dimensional feature space, that is, the third statistical feature vector is a 2M-dimensional feature vector.
  • the target category determining module 603 is further configured to determine the target category of the voice signal to be processed according to the third statistical feature vector.
  • the target category determination module 603 may also perform dimensionality reduction processing on the third statistical feature vector through the dimensionality reduction module to obtain a two-dimensional feature vector, thereby determining the target category of the voice signal to be processed according to the two-dimensional feature vector.
  • the target category determination module 603 can preset the corresponding relationship between the two-dimensional feature vector and the voice signal category, and then can perform dimensionality reduction processing on the third statistical feature vector to obtain the two-dimensional feature vector according to the obtained two-dimensional feature vector.
  • the corresponding relationship between the two-dimensional feature vector and the two-dimensional feature vector and the voice signal category determines the voice signal category corresponding to the two-dimensional feature vector, thereby determining the target category of the voice signal to be processed.
  • the device 60 further includes: a first model training module 604, configured to obtain a first sample statistical feature vector corresponding to the first sample speech signal, the first sample statistical feature vector It is used to represent the statistical value of each dimension of the feature space of the first sample voice signal in the M-dimensional feature space, where M is an integer greater than 1, and the first sample voice signal is a recording and replaying voice signal or an original voice signal.
  • a first model training module 604 configured to obtain a first sample statistical feature vector corresponding to the first sample speech signal, the first sample statistical feature vector It is used to represent the statistical value of each dimension of the feature space of the first sample voice signal in the M-dimensional feature space, where M is an integer greater than 1, and the first sample voice signal is a recording and replaying voice signal or an original voice signal.
  • the first sample speech signal is a speech signal prepared for training the first model.
  • the first sample voice signal may be obtained by recording the original voice signal, or may be obtained by recording the recording and replaying voice signal.
  • the first model training module 604 may determine the target category of the first sample voice signal when acquiring the first sample voice signal, that is, predetermine it before inputting the first sample voice signal into the first model for processing.
  • the first sample voice signal belongs to the original voice signal or the recorded and replayed voice signal.
  • the target category of the first sample voice signal 1, the first sample voice signal 2, the first sample voice signal 3 may be recorded in advance, for example, the first sample voice signal 1, the first sample voice signal 2, the first sample voice signal If the target categories of the sample voice signal 3 are original voice signal, original voice signal, recording and playback voice signal, the first sample voice signal 1 ⁇ original voice signal, and the first sample voice signal 2 ⁇ original voice can be recorded. Signal, first sample voice signal 3 ⁇ recording and replaying voice signal, etc.
  • the first model training module 604 can obtain the first sample statistical feature vector corresponding to the first sample speech signal, which may refer to the method of obtaining the first statistical feature vector corresponding to the speech signal to be processed in step S101. Go into details.
  • the first sample statistical feature vector includes the first sample mean vector and/or the first sample standard deviation vector, and the first sample mean vector is used to represent the feature space of each dimension of the first sample speech signal in the M-dimensional feature space
  • the mean value of the first sample standard deviation vector is used to represent the standard deviation of each dimensional feature space of the first sample speech signal in the M-dimensional feature space.
  • the first model training module 604 is further configured to input the first sample statistical feature vector into the first model for processing to obtain a second sample statistical feature vector.
  • the first sample statistical feature vector includes the first sample mean vector and the first sample standard deviation vector
  • the second sample statistical feature vector includes the second sample mean vector and the second sample standard deviation vector
  • the second sample mean vector The vector is obtained based on the first sample mean vector and the first model
  • the second sample standard deviation vector is obtained based on the first sample standard vector and the first model.
  • the first model training module 604 obtains the first sample statistical feature vector corresponding to the first sample speech signal, inputs the first sample statistical feature vector into the first model, and passes the weight module in the first model Perform weight calculation on the statistical feature vector of the first sample to obtain the statistical feature vector of the second sample.
  • the third sample statistical feature vector can also be obtained according to the second sample statistical feature vector, and the third sample statistical feature vector is reduced by the dimensionality reduction module to obtain a two-dimensional sample feature vector, the two-dimensional sample feature vector Corresponds to a target category.
  • the weight module includes a target weight matrix.
  • the target weight matrix is used to express the importance of each dimension feature space in the M-dimensional feature space.
  • This statistical feature vector is weighted to obtain the third sample statistical feature vector;
  • the dimensionality reduction module may include a fully connected layer to reduce the amount of calculation in the training of the first model, for example, the obtained third sample statistical feature vector is a high-dimensional feature matrix .
  • the high-dimensional feature matrix is 2M-dimensional
  • the 2-dimensional low-dimensional feature matrix can be obtained by reducing the dimensionality of the dimensionality reduction module. It can reduce the amount of calculation in model training.
  • the first model training module 604 is further configured to calculate the first loss of the first model according to the statistical feature vector of the second sample.
  • the first model training module 604 calculates the first loss of the first model according to the statistical feature vector of the second sample, that is, the first model training module 604 calculates the first loss according to the first sample mean vector and the first sample standard deviation vector. The first loss of the model.
  • the first model training module 604 predetermines the target category of the first sample speech signal, and uses the first model to process the first sample statistical feature vector corresponding to the first sample speech signal to obtain the second The sample statistical feature vector, and the third sample statistical feature vector is obtained according to the second sample statistical feature vector, and the third sample statistical feature vector is reduced by the dimensionality reduction module, and the obtained two-dimensional sample feature vector corresponds to a target category.
  • the first loss of the first model is calculated according to the predetermined similarity between the target category of the first sample speech signal and the target category corresponding to the two-dimensional sample feature vector.
  • the higher the similarity, the smaller the first loss, and the lower the similarity the greater the first loss, where the first loss may be a cross-entropy loss.
  • the first model training module 604 is further configured to train the first model according to the first loss.
  • the first model training module 604 may use the gradient descent method to adjust the first model, that is, adjust the weight module in the first model, and the first model training module 604 may also The gradient descent method is used to adjust the dimensionality reduction module to make the parameters of the model training more accurate, so that the second statistical feature vector obtained by the first model processing can more accurately reflect the category of the first sample speech signal.
  • the device 60 further includes: a voice signal acquisition module 605, configured to acquire a first voice signal, and the first voice signal is a recording and replaying voice signal.
  • the voice signal acquisition module 605 may record the original voice signal through a recording device or the like to obtain a recorded and replayed voice signal.
  • the speech signal acquisition module 605 is further configured to acquire a second feature vector of the first speech signal, and input the second feature vector into an encoding model for encoding processing to obtain a fourth statistical feature vector.
  • the statistical feature vector is used to represent the statistical feature of the first speech signal.
  • the second feature vector may include an LPCC feature vector, an MFCC feature vector, or a CQCC feature vector
  • the speech signal acquisition module 605 can obtain the second feature by performing LPCC feature extraction, MFCC feature extraction, or CQCC feature extraction on the first voice signal vector.
  • the voice signal acquisition module 605 obtains the second feature vector by extracting the features of the first voice signal, and inputs the second feature vector into the coding model for coding processing to obtain the fourth statistical feature vector, and the fourth statistical feature
  • the vector includes a third mean vector and a third standard deviation vector.
  • the third mean vector is used to represent the mean value of each dimension of the first voice signal in the M-dimensional feature space
  • the third standard deviation vector is used to represent the first voice signal in the M-dimensional feature space. The standard deviation of each dimensional feature space in the M-dimensional feature space.
  • the speech signal acquisition module 605 is further configured to construct a first implicit vector according to the fourth statistical feature vector, and input the first implicit vector into a decoding model for decoding processing to obtain a third feature vector, where: The similarity between the second voice signal generated by the third feature vector and the first voice signal satisfies a target condition.
  • the target condition is that the similarity between the second speech signal and the first speech signal satisfies the similarity threshold.
  • the similarity threshold can be 80%, 90%, 95%, etc., that is, the first speech signal and the second speech signal.
  • the voice signals are two voice signals with high similarity, that is, in this way, a recording and replaying voice signal can be used to generate a recording and replaying voice signal with high similarity to the recording and replaying voice signal.
  • there is X If there are two recording and replaying voice signals, 2X recording and replaying voice signals can be generated in the above manner, and the 2X recording and replaying voice signals are further used to train the first model.
  • the speech signal acquisition module 605 constructs the first implicit vector according to the fourth statistical feature vector, that is, constructs the first implicit vector according to the third mean vector and the third standard deviation vector, and inputs the first implicit vector into the decoding model for decoding Process to obtain the third feature vector.
  • the encoding model and the decoding model may be the encoding layer and the decoding layer in a Variational Autoencoder (VAE).
  • VAE Variational Autoencoder
  • the encoding model and the decoding model The decoding model is trained to make the trained coding model and decoding model more accurate, so that the obtained second speech signal is more similar to the first speech signal corresponding to the second feature vector in the input coding model. Refer to Figure 5 for the training method of the coding model and the decoding model.
  • the voice signal acquisition module 605 is further configured to, if the first sample voice signal is a recording and replaying voice signal, then the first sample voice signal is the first voice signal or the second voice signal .
  • the device 60 further includes:
  • the second model training module 606 is configured to obtain the first sample feature vector corresponding to the second sample speech signal.
  • the method for the second model training module 606 to obtain the first sample feature vector corresponding to the second sample speech signal can refer to the method for obtaining the second feature vector of the first speech signal in step S302, which will not be repeated here.
  • the second model training module 606 is further configured to input the first sample feature vector into the coding model for coding processing to obtain a third sample statistical feature vector, and the third sample statistical feature vector is used to represent all Describe the statistical characteristics of the second sample speech signal.
  • the third sample statistical feature vector includes the second sample mean vector and the second sample standard deviation vector.
  • the second sample mean vector is used to represent the mean value of each dimension of the second sample speech signal in the M-dimensional feature space.
  • the second The sample standard deviation vector is used to represent the standard deviation of each dimensional feature space of the second sample speech signal in the M-dimensional feature space. That is, the second model training module 606 inputs the first sample feature vector into the coding model for coding processing, and obtains the second sample mean vector and the second sample standard deviation vector.
  • the second model training module 606 is further configured to determine the second loss according to the statistical feature vector of the third sample and the standard normal distribution function.
  • the second model training module 606 may first determine the first normal distribution function according to the second sample mean vector and the second sample standard deviation vector, and then according to the degree of coincidence between the first normal distribution function and the standard normal distribution function Determine the second loss, where the higher the degree of coincidence between the first normal distribution function and the standard normal distribution function, the smaller the second loss, the degree of coincidence between the first normal distribution function and the standard normal distribution function The lower the value, the greater the second loss.
  • the second loss can be a divergence loss.
  • the degree of coincidence between the first normal distribution function and the standard normal distribution function corresponds to the first normal distribution function on the coordinate axis.
  • the degree of coincidence between the graph and the graph corresponding to the standard normal distribution function on the coordinate axis.
  • the second model training module 606 is further configured to construct a first sample implicit vector according to the third sample statistical feature vector, and input the first sample implicit vector into the decoding model for decoding processing, Obtain the second sample feature vector.
  • the second model training module 606 can multiply the second sample standard deviation vector by the standard normal distribution function, and add the second sample mean vector to obtain the first sample implicit vector, that is, the first sample implicit vector.
  • the vector can be the product of the standard normal distribution function and the second sample standard deviation vector, and then the sum of the second sample mean vector.
  • the second model training module 606 inputs the implicit vector of the first sample into the decoding model for decoding processing to obtain the second sample feature vector.
  • the second sample feature vector is a feature vector constructed after inputting the first sample feature vector into the encoding model and the decoding model.
  • the second model training module 606 is further configured to determine a third loss according to the first sample feature vector and the second sample feature vector.
  • the second model training module 606 can determine the third loss according to the similarity between the first sample feature vector and the second sample feature vector, that is, the higher the similarity between the first sample feature vector and the second sample feature vector, then The smaller the third loss; the lower the similarity between the feature vector of the first sample and the feature vector of the second sample, the greater the third loss, where the third loss may be a cross-entropy loss.
  • the second model training module 606 is further configured to train the encoding model and the decoding model according to the second loss and the third loss.
  • the second model training module 606 can use the gradient descent method to adjust the parameters in the encoding model.
  • the gradient descent method can be used to adjust the parameters in the decoding model. The parameters are adjusted to make the adjusted encoding model and decoding model more accurate, so that the second sample feature vector obtained by processing the encoding model and the decoding model is more similar to the first sample feature vector.
  • the first voice signal and the second voice signal are both recorded and reproduced voice signals
  • the first voice signal is the voice signal corresponding to the first sample feature vector input to the coding model
  • the second voice signal is the output from the decoding model
  • the recorded and replayed voice signal may include a transcribed voice signal
  • the transcribed voice signal is a voice signal obtained by transcribing the recorded and replayed voice signal through an encoding model and a decoding model, that is, the second voice signal.
  • the recording and replaying the voice signal also includes the transcription voice signal, that is, the transcription obtained by inputting the statistical feature vector corresponding to the transcription voice signal into the first model for processing
  • the target category of the voice signal is recording and replaying the voice signal.
  • the first statistical feature vector corresponding to the speech signal to be processed is obtained, the first statistical feature vector is input to the first model for processing, the second statistical feature vector is obtained, and the second statistical feature vector is determined to be processed
  • the target category of the speech signal so as to determine whether the speech signal to be processed is the original speech signal or the recorded and reproduced speech signal. Since the first statistical feature vector is processed according to the importance of each dimension of the feature space in the M-dimensional feature space, the M-dimensional feature space is strengthened.
  • the statistical features of each dimension of the feature space in the feature space can more accurately reflect the statistical features of the voice signal to be processed, thereby accurately determining the target category of the voice signal to be processed, improving the accuracy of recording and replaying detection, and does not need to be treated Process the content of the voice signal for detection, improve the detection efficiency, and have strong applicability; because a large number of sample voice signals are used to train the first model, the first model after training is more accurate, so that the recording and replay detection results are more accurate ; By using the coding model and the decoding model to process the statistical feature vector corresponding to the sample voice signal, a large number of sample voice signals can be quickly obtained. Compared with the way of recording the sample voice signal through a recording device, this method obtains a large number of samples The efficiency of the voice signal is higher.
  • FIG. 7 is a schematic diagram of the composition structure of a voice signal processing device provided by an embodiment of the present application.
  • the device 70 includes a processor 701, a memory 702, and an input and output interface 703.
  • the processor 701 is connected to the memory 702 and the input/output interface 703.
  • the processor 701 may be connected to the memory 702 and the input/output interface 703 through a bus.
  • the processor 701 is configured to support the voice signal processing device to perform corresponding functions in the voice signal processing methods described in FIG. 1 to FIG. 2 and FIG. 4.
  • the processor 701 may be a central processing unit (CPU), a network processor (NP), a hardware chip, or any combination thereof.
  • the aforementioned hardware chip may be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the memory 702 is used to store program codes and the like.
  • the memory 702 may include a volatile memory (volatile memory, VM), such as random access memory (random access memory, RAM); the memory 702 may also include a non-volatile memory (non-volatile memory, NVM), such as read only Memory (read-only memory, ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); the memory 702 may also include a combination of the foregoing types of memories.
  • volatile memory volatile memory
  • RAM random access memory
  • NVM non-volatile memory
  • ROM read only Memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the input and output interface 703 is used to input or output data.
  • the processor 701 may call the program code to perform the following operations: obtain a first statistical feature vector corresponding to the voice signal to be processed, where the first statistical feature vector is used to indicate that each voice signal to be processed in the M-dimensional feature space The statistical value of the dimensional feature space, where M is an integer greater than 1.
  • the first statistical feature vector is input to a first model for processing to obtain a second statistical feature vector, and the first model is used to The importance of each dimension of the feature space in the feature space is processed on the first statistical feature vector; according to the second statistical feature vector, the target category of the voice signal to be processed is determined, and the target category includes the original voice signal or Record and replay the voice signal.
  • each operation may also refer to the corresponding description of the foregoing method embodiment; the processor 701 may also cooperate with the input and output interface 703 to perform other operations in the foregoing method embodiment.
  • An embodiment of the present application also provides a computer storage medium, the computer storage medium stores a computer program, the computer program includes program instructions, and the program instructions when executed by a computer cause the computer to execute as described in the previous embodiment
  • the computer may be a part of the voice signal processing device mentioned above.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the program can be stored in a computer readable storage medium. When executed, it may include the procedures of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种语音信号处理方法、装置及设备,该方法包括:获取待处理语音信号对应的第一统计特征向量,该第一统计特征向量用于表示待处理语音信号在M维特征空间中每维特征空间的统计值,M为大于1的整数(S101);将第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,该第一模型用于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理(S102);根据第二统计特征向量,确定待处理语音信号的目标类别,该目标类别包括原始语音信号或者录音重放语音信号(S103)。该方法可以提高录音重放信号检测的准确性。

Description

语音信号处理方法、装置及设备
本申请要求于2020年02月17日提交中国专利局、申请号为202010096100.4、申请名称为“语音信号处理方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信号处理领域,尤其涉及语音信号处理方法、装置及设备。
背景技术
近几年,声纹识别技术在远程无监督身份认证领域中得到了广泛应用。然而在使用中也存在很多安全隐患,比如,录制说话人语音,然后进行录音重放的攻击手段是声纹识别系统面临的最常见的攻击手段,录音重放攻击是一种用高保真录音设备录制目标人物的语音,然后运用录制的语音信号去破解声纹认证系统的技术手段。语音重放攻击的语音是来自说话人本人,因此更具真实性,这种攻击对系统的安全性将造成更大的威胁。
目前为了避免录音重放攻击,用户在进行声纹验证时,系统会规定用户需要诵读的文本语句,在进行声纹验证时,辅以语音内容识别进行录音重放检测。发明人发现,在用户口音严重或者有自己特殊发音习惯时,语音内容识别准确率大幅下降,降低录音重放语音信号检测的准确性。
发明内容
基于在用户口音严重或者有自己特殊发音习惯时,语音内容识别准确率大幅下降,降低录音重放语音信号检测的准确性的问题,本申请实施方式提供一种语音信号处理方法、装置、设备及计算机存储介质,可以提高录音重放信号检测的准确性,并且不需要对语音信号内容进行检测,提高检测效率。
第一方面,本申请实施例提供了语音信号处理方法,包括:获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
第二方面,本申请实施例提供了提供语音信号处理装置,包括:第一特征获取模块,用于获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;第二特征获取模块,用于将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;目标类别确定模块,用于根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
第三方面,本申请实施例提供了提供语音信号处理设备,包括处理器、存储器、以及 输入输出接口,所述处理器、存储器和输入输出接口相互连接,其中,所述输入输出接口用于输入或输出数据,所述存储器用于存储语音信号处理设备执行上述方法的应用程序代码,所述处理器被配置用于执行以下步骤:获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
第四方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下步骤:获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
本申请实施例中,通过获取待处理语音信号对应的第一统计特征向量,将第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,并根据第二统计特征向量确定待处理语音信号的目标类别,从而确定待处理语音信号为原始语音信号或者录音重放语音信号,由于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理,强化了M维特征空间中每维特征空间的统计特征,可以更准确的反映待处理语音信号的统计特征,从而准确地确定出待处理语音信号的目标类别,提高录音重放检测的准确性,并且不需要对待处理语音信号的内容进行检测,提高检测效率,适用性强。
附图说明
图1是本申请实施例提供的一种语音信号处理方法的流程示意图。
图2是本申请实施例提供的另一种语音信号处理方法的流程示意图。
图3是本申请实施例提供的一种训练第一模型的示意图。
图4是本申请实施例提供的另一种语音信号处理方法的流程示意图。
图5是本申请实施例提供的一种训练编码模型与解码模型的示意图。
图6是本申请实施例提供的一种语音信号处理装置的组成结构示意图。
图7是本申请实施例提供的一种语音信号处理设备的组成结构示意图。
具体实施方式
本申请实施例的方案适用于对语音信号进行处理,从而确定出语音信号所属的目标类别是否为录音重放语音信号类别的场景中,通过获取待处理语音信号对应的第一统计特征向量,将第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,并根据第二统计特征向量确定待处理语音信号的目标类别,从而确定待处理语音信号为原始语音信号 或者录音重放语音信号,由于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理,强化了M维特征空间中每维特征空间的统计特征,可以更准确的反映待处理语音信号的统计特征,从而准确地确定出待处理语音信号的目标类别,提高录音重放检测的准确性,并且不需要对待处理语音信号的内容进行检测,提高检测效率,适用性强。
参见图1,图1是本申请实施例提供的一种语音信号处理方法的流程示意图,如图所示,该方法包括S101-S103。
S101,获取待处理语音信号对应的第一统计特征向量,第一统计特征向量用于表示待处理语音信号在M维特征空间中每维特征空间的统计值,M为大于1的整数。
本申请实施例可以应用于声纹识别认证系统中,即通过检测待处理语音信号的声纹确定用户的身份,待处理语音信号可以是用于声纹识别的语音信号,为了避免非法人员采用某个用户的录音重放语音信号进行声纹识别,因此本申请实施例需要对待处理语音信号进行录音重放检测。
其中,待处理语音信号可以是原始语音信号或者录音重放语音信号,其中,原始语音信号可以为通过用户直接发声(例如说话)所产生的语音信号(即未经过录音录像等设备进行录音重放的语音信号),录音重放语音信号可以包括对用户直接发声所产生的语音信号进行录音得到的语音信号,或者通过信号合成等方式合成的非用户直接发声所产生的语音信号,等等。本技术方案中,除原始语音信号以外的其他语音信号都称为录音重放语音信号。
第一统计特征向量包括第一均值向量和/或第一标准差向量,第一均值向量用于表示待处理语音信号在M维特征空间中每维特征空间的均值,第一标准差向量用于表示待处理语音信号在M维特征空间中每维特征空间的标准差。
在一种可实现的方式中,具体获取待处理语音信号对应的第一统计特征向量可以包括以下步骤一至四。
一、将待处理语音信号划分为N个语音帧,N为大于或者等于1的整数。
具体地,可以以预设的采样周期对待处理语音信号进行采样,将连续的待处理语音信号变换为离散化的语音信号,采样周期可以为根据奈奎斯特采样定理确定的周期;然后通过传递函数为H(Z)=1-αZ-1的数字滤波器对离散后的语音信号进行滤波,增加语音信号的高频分辨率,α为预加重系数,α大于0.9小于1;最后,可以利用窗函数对离散的语音信号进行分帧处理得到多个语音帧,这里即得到N个语音帧,其中,窗函数可以为矩形窗、汉明窗或汉宁窗中的任意一种窗函数。
可选地,还可以通过端点检测剔除语音帧中的噪声和干扰。其中,可以通过基于能量的端点检测、基于信息熵的端点检测或基于频带方差的端点检测等方式进行端点检测。
二、获取N个语音帧中每个语音帧的第一特征向量,第一特征向量用于表示语音帧在M维特征空间中每维特征空间的特征值。
这里,例如M为400,N为100,则获取到100个语音帧,其中,每个语音帧的第一特征向量为具有400维特征空间的特征向量,即第一特征向量用于表示语音帧在400维特征 空间中每维特征空间的特征值,则获取到100个400维的第一特征向量。
具体地,可以对N个语音帧中每个语音帧进行线性预测倒谱系数(linear prediction cepstral coefficients,LPCC)特征提取、梅尔频率倒谱系数(Mel-scale frequency cepstral coefficients,MFCC)特征提取、或者常量Q倒谱系数(Constant Q cepstral coefficients,CQCC)特征提取得到第一特征向量。
具体实现中,以对N个语音帧中每个语音帧进行CQCC特征提取得到第一特征向量为例,可以首先对N个语音帧中每个语音帧对应的语音信号进行常Q变换(Constant Q Transform,CQT),将时域信号转变为频域信号;其次,计算N个语音帧中每个语音帧的能量谱,对能量谱取对数得到对数能量谱;最后,将对数能量谱进行均匀重采样得到采样函数,再对采样函数进行离散余弦变换(Discrete Consine Transform,DCT)得到CQCC特征向量,即第一特征向量,由此可得到N个语音帧中每个语音帧的第一特征向量。
三、针对M维特征空间中的每维特征空间,计算该维特征空间对应的统计值,该维特征空间对应的统计值为N个语音帧在该维特征空间的统计值。
这里,即针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的统计值。例如M为400,N为100,则针对400维特征空间中的每维特征空间,计算100个语音帧在该维特征空间的统计值。
其中,统计值可以包括均值和/或标准差,即针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的均值,得到M维均值向量;针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的标准差,得到M维标准差向量。例如针对400维特征空间中的每维特征空间,计算100个语音帧在该维特征空间的均值和/或标准差,得到400维均值向量和/或400维标准差向量。
四、根据M维特征空间中每维特征空间对应的统计值,构建待处理语音信号对应的第一统计特征向量。
这里,第一统计特征向量包括第一均值向量和/或第一标准差向量,第一均值向量用于表示待处理语音信号在M维特征空间中每维特征空间的均值,第一标准差向量用于表示待处理语音信号在M维特征空间中每维特征空间的标准差。
在统计值包括均值和标准差的情况下,根据M维特征空间中每维特征空间对应的统计值,构建待处理语音信号对应的第一统计特征向量即:根据M维特征空间中每维特征空间对应的均值,构建待处理语音信号对应的第一均值向量,以及根据M维特征空间中每维特征空间对应的标准差,构建待处理语音信号对应的第一标准差向量。可知,第一均值向量为由M个均值组成的具有M维特征空间的向量,第一标准差向量为由M个标准差组成的具有M维特征空间的向量。
S102,将第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,第一模型用于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理。
具体地,可以通过第一模型中的权重模块对第一统计特征向量进行处理,得到第二统计特征向量。这里,权重模块包括目标权重矩阵,目标权重矩阵可以为具有M维特征空间 的矩阵,其中,M维特征空间中的每维特征空间对应的数值用于表示该维特征空间的重要程度,即特征空间对应的数值越大,则该维特征空间的重要程度越高;特征空间对应的数值越小,则该维特征空间的重要程度越低。
可选地,目标权重矩阵可以根据第一均值向量的M维特征空间中的每维特征空间的均值大小,按照目标规则为每维特征空间分配权重得到,即目标权重矩阵为具有M维特征空间的矩阵。这里,目标规则可以为第一均值向量的M维特征空间中某一维度的均值大,则该维特征空间的权重大;第一均值向量的M维特征空间中某一维度的均值小,则该维特征空间的权重小,即第一均值向量的M维特征空间中均值越大的特征空间的权重越大,均值越小的特征空间的权重越小。
需要说明的是,可以预先对第一模型进行训练,使得通过使用训练后的第一模型处理得到的第二统计特征向量更准确的表示待处理语音信号的类别,具体地对第一模型进行训练的过程可参考图3对应的实施例中的描述,此处不做过多描述。
S103,根据第二统计特征向量,确定待处理语音信号的目标类别,目标类别包括原始语音信号或者录音重放语音信号。
这里,原始语音信号可以为通过用户直接发声所产生的语音信号(即未经过录音录像等设备进行录音重放的语音信号),录音重放语音信号可以包括对用户直接发声所产生的语音信号进行录音得到的语音信号,或者通过信号合成等方式合成的非用户直接发声所产生的语音信号,等等。
具体地,若第一统计特征向量包括第一均值向量和第一标准差向量,则第二统计特征向量包括第二均值向量和第二标准差向量,第二均值向量是根据第一均值向量和第一模型得到,第二标准差向量是根据第一标准向量和第一模型得到。具体实现中,第二均值向量可以为第一均值向量与目标权重矩阵之积,第二标准向量可以为第一标准向量与目标权重矩阵之积。
在一种可能的实现方式中,可以首先根据第二均值向量和第二标准差向量,构建第三统计特征向量;再根据第三统计特征向量,确定待处理语音信号的目标类别。
具体实现中,可以通过拼接第二均值向量和第二标准差向量得到第三统计特征向量,由于第二均值向量为具有M维特征空间的向量,第二标准差向量为具有M维特征空间的向量,则拼接得到的第三统计特征向量可以为具有2M维特征空间的向量,即第三统计特征向量为2M维特征向量。
可选地,还可以通过降维模块对第三统计特征向量进行降维处理,得到二维特征向量,由此根据该二维特征向量确定待处理语音信号的目标类别。具体实现中,可以预先设置二维特征向量与语音信号类别之间的对应关系,则可以在对第三统计特征向量进行降维处理得到二维特征向量时,根据得到的二维特征向量以及二维特征向量与语音信号类别之间的对应关系确定出该二维特征向量对应的语音信号类别,从而确定待处理语音信号的目标类别。
本申请实施例中,通过获取待处理语音信号对应的第一统计特征向量,将第一统计特 征向量输入第一模型进行处理,获得第二统计特征向量,并根据第二统计特征向量确定待处理语音信号的目标类别,从而确定待处理语音信号为原始语音信号或者录音重放语音信号,由于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理,强化了M维特征空间中每维特征空间的统计特征,可以更准确的反映待处理语音信号的统计特征,从而准确地确定出待处理语音信号的目标类别,提高录音重放检测的准确性,并且不需要对待处理语音信号的内容进行检测,提高检测效率,适用性强。
在一种可能的实现方式中,为了使得通过第一模型处理后的第二统计特征向量更准确的表示待处理语音信号的类别,在将第一统计特征向量输入第一模型进行处理(使用第一模型)之前,还可以使用大量样本语音信号对第一模型进行训练,并根据训练得到的损失值对第一模型进行调整,使得通过训练完成的第一模型处理后的第二统计特征向量能更准确的表示待处理语音信号的类别,具体训练第一模型的步骤如图2所示,图2是本申请实施例提供的另一种语音信号处理方法的流程示意图,如图所示,该方法包括S201-S204。
S201,获取第一样本语音信号对应的第一样本统计特征向量,第一样本统计特征向量用于表示第一样本语音信号在M维特征空间中每维特征空间的统计值,M为大于1的整数,第一样本语音信号为录音重放语音信号或者原始语音信号。
这里,第一样本语音信号为用于训练第一模型所准备的语音信号。例如,第一样本语音信号可以通过对原始语音信号进行录音得到,也可以通过对录音重放语音信号进行录音得到。
可选地,在获取第一样本语音信号时,可确定第一样本语音信号的目标类别,即在将第一样本语音信号输入第一模型进行处理前预先确定出第一样本语音信号属于原始语音信号或者录音重放语音信号。例如,可以预先记录第一样本语音信号1、第一样本语音信号2、第一样本语音信号3的目标类别,例如第一样本语音信号1、第一样本语音信号2、第一样本语音信号3的目标类别分别为原始语音信号、原始语音信号、录音重放语音信号,则可记录第一样本语音信号1~原始语音信号、第一样本语音信号2~原始语音信号、第一样本语音信号3~录音重放语音信号,等等。
具体实现中,获取第一样本语音信号对应的第一样本统计特征向量可参考步骤S101中获取待处理语音信号对应的第一统计特征向量的方法,此处不再赘述。第一样本统计特征向量包括第一样本均值向量和/或第一样本标准差向量,第一样本均值向量用于表示第一样本语音信号在M维特征空间中每维特征空间的均值,第一样本标准差向量用于表示第一样本语音信号在M维特征空间中每维特征空间的标准差。
S202,将第一样本统计特征向量输入第一模型进行处理,获得第二样本统计特征向量。
这里,若第一样本统计特征向量包括第一样本均值向量和第一样本标准差向量,第二样本统计特征向量包括第二样本均值向量和第二样本标准差向量,第二样本均值向量是根据第一样本均值向量和第一模型得到的,第二样本标准差向量是根据第一样本标准向量和第一模型得到的。
下面具体介绍将第一样本统计特征向量输入第一模型进行处理,获得第二样本统计特 征向量的过程,可参考图3,图3是本申请实施例提供的一种训练第一模型的示意图,如图所示:获取第一样本语音信号对应的第一样本统计特征向量,将第一样本统计特征向量输入第一模型,通过第一模型中的权重模块对第一样本统计特征向量进行权重计算,得到第二样本统计特征向量。可选地,还可以根据第二样本统计特征向量得到第三样本统计特征向量,通过降维模块对第三样本统计特征向量进行降维处理,得到二维样本特征向量,该二维样本特征向量对应一个目标类别。
其中,权重模块包括目标权重矩阵,目标权重矩阵用于表示M维特征空间中每维特征空间的重要程度,即权重模块用于根据M维特征空间中每维特征空间的重要程度对第一样本统计特征向量进行权重计算得到第三样本统计特征向量;降维模块可以包括全连接层,用于减少第一模型训练中的计算量,例如得到的第三样本统计特征向量为高维特征矩阵,通过降维模块对高维特征矩阵进行降维可以得到低维特征矩阵,例如高维特征矩阵为2M维,通过降维模块进行降维可以得到二维的低维特征矩阵,通过降维处理可以减少模型训练中的计算量。
S203,根据第二样本统计特征向量,计算第一模型的第一损失。
这里,根据第二样本统计特征向量,计算第一模型的第一损失即根据第一样本均值向量和第一样本标准差向量,计算第一模型的第一损失。
具体实现中,由于预先确定了第一样本语音信号的目标类别,且通过第一模型对第一样本语音信号对应的第一样本统计特征向量进行处理得到第二样本统计特征向量,并根据第二样本统计特征向量得到第三样本统计特征向量,以及通过降维模块对第三样本统计特征向量进行降维处理,得到的二维样本特征向量对应一个目标类别,根据预先确定的第一样本语音信号的目标类别与二维样本特征向量对应的目标类别之间的相似度,计算第一模型的第一损失。这里,相似度越高,则第一损失越小,相似度越低,则第一损失越大,其中,第一损失可以为交叉熵损失。
S204,根据第一损失,训练第一模型。
这里,在第一损失较大的情况下,可以采用梯度下降法对第一模型进行调整,即对第一模型中的权重模块进行调整,还可以采用梯度下降法对降维模块进行调整,使得模型训练的参数以及降维模块中的参数更准确,从而使得通过第一模型处理后得到的第二统计特征向量能更准确的反映第一样本语音信号的类别。
本申请实施例中,由于使用大量样本语音信号对第一模型进行训练,且根据预先确定的各个样本语音信号的目标类别与通过第一模型处理后得到的样本语音信号的目标类别之间的相似度确定第一模型的第一损失,并根据第一损失确定第一模型是否准确,在第一损失较大的情况下,对第一模型进行调整,使得通过训练完成的第一模型处理得到的第二统计特征向量更加准确地表示样本语音信号的目标类别,由于使用了大量的样本语音信号对第一模型进行训练,因此训练后的第一模型更准确,从而使得录音重放检测结果更准确。
在一种可能的实现方式中,为了使得训练得到的第一模型更准确,需要使用大量的样本语音信号对第一模型进行训练,因此需要获取大量样本语音信号,由于通过对语音信号 进行录音获取录音重放的样本语音信号的方式效率较低,因此可以通过以下方式快速获取大量录音重放的样本语音信号,具体获取大量样本语音信号的步骤如图4所示,图4是本申请实施例提供的另一种语音信号处理方法的流程示意图,如图所示,该方法包括S301-S303。
S301,获取第一语音信号,第一语音信号为录音重放语音信号。
S302,获取第一语音信号的第二特征向量,并将第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,第四统计特征向量用于表示第一语音信号的统计特征。
这里,第二特征向量可以包括LPCC特征向量、MFCC特征向量、或者CQCC特征向量,可以通过对第一语音信号进行LPCC特征提取、MFCC特征提取、或者CQCC特征提取得到第二特征向量。具体对第一语音信号进行CQCC特征提取得到CQCC特征向量的方法可参考步骤S101中的描述,此处不再赘述。
本申请实施例中,通过对第一语音信号进行特征提取得到第二特征向量,并将第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,第四统计特征向量包括第三均值向量和第三标准差向量,第三均值向量用于表示第一语音信号在M维特征空间中每维特征空间的均值,第三标准差向量用于表示第一语音信号在M维特征空间中每维特征空间的标准差。
S303,根据第四统计特征向量构建第一隐含向量,并将第一隐含向量输入解码模型进行解码处理,获得第三特征向量,其中,由第三特征向量生成的第二语音信号与第一语音信号之间的相似度满足目标条件。
这里,目标条件为第二语音信号与第一语音信号之间的相似度满足相似度阈值,例如,相似度阈值可以为80%、90%、95%等数值,即第一语音信号与第二语音信号为相似度较高的两个语音信号,即通过该方式可以通过一个录音重放的语音信号生成一个与该录音重放语音信号相似度很高的录音重放语音信号,比如,存在X个录音重放语音信号,则通过上述方式可以产生2X个录音重放语音信号,进一步将该2X个录音重放语音信号用于训练第一模型。
这里,根据第四统计特征向量构建第一隐含向量即根据第三均值向量和第三标准差向量构建第一隐含向量,并将第一隐含向量输入解码模型进行解码处理,获得第三特征向量。
可选地,编码模型和解码模型可以为变分自编码器(Variational Autoencoder,VAE)中的编码层和解码层,在使用编码模型和解码模型之前,可以预先对编码模型与解码模型进行训练,使得训练后的编码模型以及解码模型更准确,从而使得得到的第二语音信号与输入编码模型中的第二特征向量对应的第一语音信号之间的相似度更高。对编码模型与解码模型进行训练的方式可参考图5,图5是本申请实施例提供的一种训练编码模型与解码模型的示意图。
如图所示,一、获取第二样本语音信号对应的第一样本特征向量。
具体获取第二样本语音信号对应的第一样本特征向量的方法可参考步骤S302中获取第一语音信号的第二特征向量的方法,此处不再赘述。
二、将第一样本特征向量输入编码模型进行编码处理,获得第三样本统计特征向量,第三样本统计特征向量用于表示第二样本语音信号的统计特征。
这里,第三样本统计特征向量包括第二样本均值向量和第二样本标准差向量,第二样本均值向量用于表示第二样本语音信号在M维特征空间中每维特征空间的均值,第二样本标准差向量用于表示第二样本语音信号在M维特征空间中每维特征空间的标准差。即将第一样本特征向量输入编码模型进行编码处理,获得第二样本均值向量和第二样本标准差向量。
三、根据第三样本统计特征向量和标准正态分布函数,确定第二损失。
这里,可以先根据第二样本均值向量和第二样本标准差向量确定第一正态分布函数,再根据第一正态分布函数与标准正态分布函数之间的重合度确定第二损失,其中,第一正态分布函数与标准正态分布函数之间的重合度越高,第二损失越小,第一正态分布函数与标准正态分布函数之间的重合度越低,第二损失越大,其中,第二损失可以为散度损失,第一正态分布函数与标准正态分布函数之间的重合度即在坐标轴上第一正态分布函数对应的图形与在该坐标轴上标准正态分布函数对应的图形之间的重合度。
四、根据第三样本统计特征向量构建第一样本隐含向量,并将第一样本隐含向量输入解码模型进行解码处理,获得第二样本特征向量。
具体实现中,可以通过标准正态分布函数乘以第二样本标准差向量,再加上第二样本均值向量得到第一样本隐含向量,即第一样本隐含向量可以为标准正态分布函数与第二样本标准差向量之积,再与第二样本均值向量之和。将第一样本隐含向量输入解码模型进行解码处理,获得第二样本特征向量。这里,第二样本特征向量即通过将第一样本特征向量输入编码模型与解码模型后构造得到的特征向量。
五、根据第一样本特征向量和第二样本特征向量,确定第三损失。
这里,可以根据第一样本特征向量和第二样本特征向量的相似度确定第三损失,即第一样本特征向量和第二样本特征向量的相似度越高,则第三损失越小;第一样本特征向量和第二样本特征向量的相似度越低,则第三损失越大,其中,第三损失可以为交叉熵损失。
六、根据第二损失和第三损失,训练编码模型和解码模型。
在第二损失较大的情况下,可以采用梯度下降法对编码模型中的参数进行调整,在第三损失较大的情况下,可以采用梯度下降法对解码模型中的参数进行调整,使得调整后的编码模型和解码模型更准确,从而通过编码模型和解码模型处理得到的第二样本特征向量与第一样本特征向量之间的相似度更高。
若上述用于训练第一模型的第一样本语音信号为录音重放语音信号,则第一样本语音信号为上述第一语音信号或者第二语音信号。
这里,第一语音信号与第二语音信号都为录音重放语音信号,第一语音信号为输入编码模型的第一样本特征向量对应的语音信号,第二语音信号为从解码模型中输出的第二样本特征向量对应的语音信号。可选地,录音重放语音信号可以包括转录语音信号,转录语音信号即通过编码模型与解码模型对录音重放语音信号进行转录得到的语音信号,即第二 语音信号。在训练第一模型时,可以使用原始语音信号或者录音重放语音信号,录音重放语音信号也包括转录语音信号,即通过将转录语音信号对应的统计特征向量输入第一模型处理后得到的转录语音信号的目标类别为录音重放语音信号。
本申请实施例中,通过使用编码模型和解码模型对样本语音信号对应的统计特征向量进行处理,可快速得到大量样本语音信号,相较于通过录音设备对样本语音信号进行录音的方式而言,该方法获取大量样本语音信号的效率较高。
上面介绍了本申请实施例的方法,下面介绍本申请实施例的装置。
参见图6,图6是本申请实施例提供的一种语音信号处理装置的组成结构示意图,该装置60包括:第一特征获取模块601,用于获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数。
其中,待处理语音信号可以是原始语音信号或者录音重放语音信号,其中,原始语音信号可以为通过用户直接发声(例如说话)所产生的语音信号(即未经过录音录像等设备进行录音重放的语音信号),录音重放语音信号可以包括对用户直接发声所产生的语音信号进行录音得到的语音信号,或者通过信号合成等方式合成的非用户直接发声所产生的语音信号,等等。本技术方案中,除原始语音信号以外的其他语音信号都称为录音重放语音信号。
第一统计特征向量包括第一均值向量和/或第一标准差向量,第一均值向量用于表示待处理语音信号在M维特征空间中每维特征空间的均值,第一标准差向量用于表示待处理语音信号在M维特征空间中每维特征空间的标准差。
第二特征获取模块602,用于将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理。
具体地,第二特征获取模块602可以通过第一模型中的权重模块对第一统计特征向量进行处理,得到第二统计特征向量。这里,权重模块包括目标权重矩阵,目标权重矩阵可以为具有M维特征空间的矩阵,其中,M维特征空间中的每维特征空间对应的数值用于表示该维特征空间的重要程度,即特征空间对应的数值越大,则该维特征空间的重要程度越高;特征空间对应的数值越小,则该维特征空间的重要程度越低。
可选地,目标权重矩阵可以根据第一均值向量的M维特征空间中的每维特征空间的均值大小,按照目标规则为每维特征空间分配权重得到,即目标权重矩阵为具有M维特征空间的矩阵。这里,目标规则可以为第一均值向量的M维特征空间中某一维度的均值大,则该维特征空间的权重大;第一均值向量的M维特征空间中某一维度的均值小,则该维特征空间的权重小,即第一均值向量的M维特征空间中均值越大的特征空间的权重越大,均值越小的特征空间的权重越小。
需要说明的是,可以预先对第一模型进行训练,使得通过使用训练后的第一模型处理得到的第二统计特征向量更准确的表示待处理语音信号的类别,具体地对第一模型进行训 练的过程可参考图3对应的实施例中的描述,此处不做过多描述。
目标类别确定模块603,用于根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
这里,原始语音信号可以为通过用户直接发声所产生的语音信号(即未经过录音录像等设备进行录音重放的语音信号),录音重放语音信号可以包括对用户直接发声所产生的语音信号进行录音得到的语音信号,或者通过信号合成等方式合成的非用户直接发声所产生的语音信号,等等。
在一种可能的设计中,所述第一特征获取模块601,用于:将所述待处理语音信号划分为N个语音帧,所述N为大于或者等于1的整数;
具体地,第一特征获取模块601可以以预设的采样周期对待处理语音信号进行采样,将连续的待处理语音信号变换为离散化的语音信号,采样周期可以为根据奈奎斯特采样定理确定的周期;然后通过传递函数为H(Z)=1-αZ-1的数字滤波器对离散后的语音信号进行滤波,增加语音信号的高频分辨率,α为预加重系数,α大于0.9小于1;最后,可以利用窗函数对离散的语音信号进行分帧处理得到多个语音帧,这里即得到N个语音帧,其中,窗函数可以为矩形窗、汉明窗或汉宁窗中的任意一种窗函数。
可选地,第一特征获取模块601还可以通过端点检测剔除语音帧中的噪声和干扰。其中,可以通过基于能量的端点检测、基于信息熵的端点检测或基于频带方差的端点检测等方式进行端点检测。
获取所述N个语音帧中每个语音帧的第一特征向量,所述第一特征向量用于表示所述语音帧在M维特征空间中每维特征空间的特征值。
这里,例如M为400,N为100,则获取到100个语音帧,其中,每个语音帧的第一特征向量为具有400维特征空间的特征向量,即第一特征向量用于表示语音帧在400维特征空间中每维特征空间的特征值,则获取到100个400维的第一特征向量。
具体地,第一特征获取模块601可以对N个语音帧中每个语音帧进行线性预测倒谱系数(linear prediction cepstral coefficients,LPCC)特征提取、梅尔频率倒谱系数(Mel-scale frequency cepstral coefficients,MFCC)特征提取、或者常量Q倒谱系数(Constant Q cepstral coefficients,CQCC)特征提取得到第一特征向量。
具体实现中,以对N个语音帧中每个语音帧进行CQCC特征提取得到第一特征向量为例,第一特征获取模块601可以首先对N个语音帧中每个语音帧对应的语音信号进行常Q变换(Constant Q Transform,CQT),将时域信号转变为频域信号;其次,计算N个语音帧中每个语音帧的能量谱,对能量谱取对数得到对数能量谱;最后,将对数能量谱进行均匀重采样得到采样函数,再对采样函数进行离散余弦变换(Discrete Consine Transform,DCT)得到CQCC特征向量,即第一特征向量,由此可得到N个语音帧中每个语音帧的第一特征向量。
针对所述M维特征空间中的每维特征空间,计算该维特征空间对应的统计值,所述统计值为所述N个语音帧在该维特征空间的统计值。
这里,即第一特征获取模块601针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的统计值。例如M为400,N为100,则针对400维特征空间中的每维特征空间,计算100个语音帧在该维特征空间的统计值。
其中,统计值可以包括均值和/或标准差,即第一特征获取模块601针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的均值,得到M维均值向量;第一特征获取模块601针对M维特征空间中的每维特征空间,计算N个语音帧在该维特征空间的标准差,得到M维标准差向量。例如针对400维特征空间中的每维特征空间,计算100个语音帧在该维特征空间的均值和/或标准差,得到400维均值向量和/或400维标准差向量。
根据所述M维特征空间中每维特征空间对应的统计值,构建所述待处理语音信号对应的第一统计特征向量。
这里,第一统计特征向量包括第一均值向量和/或第一标准差向量,第一均值向量用于表示待处理语音信号在M维特征空间中每维特征空间的均值,第一标准差向量用于表示待处理语音信号在M维特征空间中每维特征空间的标准差。
在统计值包括均值和标准差的情况下,第一特征获取模块601根据M维特征空间中每维特征空间对应的统计值,构建待处理语音信号对应的第一统计特征向量即:第一特征获取模块601根据M维特征空间中每维特征空间对应的均值,构建待处理语音信号对应的第一均值向量,以及根据M维特征空间中每维特征空间对应的标准差,构建待处理语音信号对应的第一标准差向量。可知,第一均值向量为由M个均值组成的具有M维特征空间的向量,第一标准差向量为由M个标准差组成的具有M维特征空间的向量。
在一种可能的设计中,所述第一统计特征向量包括第一均值向量和/或第一标准差向量,所述第一均值向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的均值,所述第一标准差向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的标准差。
在一种可能的设计中,若所述第一统计特征向量包括所述第一均值向量和所述第一标准差向量;所述第二统计特征向量包括第二均值向量和第二标准差向量,所述第二均值向量是根据所述第一均值向量和所述第一模型得到,所述第二标准差向量是根据所述第一标准向量和所述第一模型得到。
具体地,若第一统计特征向量包括第一均值向量和第一标准差向量,则第二统计特征向量包括第二均值向量和第二标准差向量,第二均值向量是根据第一均值向量和第一模型得到,第二标准差向量是根据第一标准向量和第一模型得到。具体实现中,第二均值向量可以为第一均值向量与目标权重矩阵之积,第二标准向量可以为第一标准向量与目标权重矩阵之积。
所述目标类别确定模块603,还用于根据所述第二均值向量和所述第二标准差向量,构建第三统计特征向量。
具体实现中,目标类别确定模块603可以通过拼接第二均值向量和第二标准差向量得到第三统计特征向量,由于第二均值向量为具有M维特征空间的向量,第二标准差向量为 具有M维特征空间的向量,则拼接得到的第三统计特征向量可以为具有2M维特征空间的向量,即第三统计特征向量为2M维特征向量。
所述目标类别确定模块603,还用于根据所述第三统计特征向量,确定所述待处理语音信号的目标类别。
可选地,目标类别确定模块603还可以通过降维模块对第三统计特征向量进行降维处理,得到二维特征向量,由此根据该二维特征向量确定待处理语音信号的目标类别。具体实现中,目标类别确定模块603可以预先设置二维特征向量与语音信号类别之间的对应关系,则可以在对第三统计特征向量进行降维处理得到二维特征向量时,根据得到的二维特征向量以及二维特征向量与语音信号类别之间的对应关系确定出该二维特征向量对应的语音信号类别,从而确定待处理语音信号的目标类别。
在一种可能的设计中,所述装置60还包括:第一模型训练模块604,用于获取第一样本语音信号对应的第一样本统计特征向量,所述第一样本统计特征向量用于表示所述第一样本语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数,所述第一样本语音信号为录音重放语音信号或者原始语音信号。
这里,第一样本语音信号为用于训练第一模型所准备的语音信号。例如,第一样本语音信号可以通过对原始语音信号进行录音得到,也可以通过对录音重放语音信号进行录音得到。
可选地,第一模型训练模块604在获取第一样本语音信号时,可确定第一样本语音信号的目标类别,即在将第一样本语音信号输入第一模型进行处理前预先确定出第一样本语音信号属于原始语音信号或者录音重放语音信号。例如,可以预先记录第一样本语音信号1、第一样本语音信号2、第一样本语音信号3的目标类别,例如第一样本语音信号1、第一样本语音信号2、第一样本语音信号3的目标类别分别为原始语音信号、原始语音信号、录音重放语音信号,则可记录第一样本语音信号1~原始语音信号、第一样本语音信号2~原始语音信号、第一样本语音信号3~录音重放语音信号,等等。
具体实现中,第一模型训练模块604获取第一样本语音信号对应的第一样本统计特征向量可参考步骤S101中获取待处理语音信号对应的第一统计特征向量的方法,此处不再赘述。第一样本统计特征向量包括第一样本均值向量和/或第一样本标准差向量,第一样本均值向量用于表示第一样本语音信号在M维特征空间中每维特征空间的均值,第一样本标准差向量用于表示第一样本语音信号在M维特征空间中每维特征空间的标准差。
所述第一模型训练模块604,还用于将所述第一样本统计特征向量输入所述第一模型进行处理,获得第二样本统计特征向量。
这里,若第一样本统计特征向量包括第一样本均值向量和第一样本标准差向量,第二样本统计特征向量包括第二样本均值向量和第二样本标准差向量,第二样本均值向量是根据第一样本均值向量和第一模型得到的,第二样本标准差向量是根据第一样本标准向量和第一模型得到的。
下面具体介绍将第一样本统计特征向量输入第一模型进行处理,获得第二样本统计特 征向量的过程,可参考图3,图3是本申请实施例提供的一种训练第一模型的示意图,如图所示:第一模型训练模块604获取第一样本语音信号对应的第一样本统计特征向量,将第一样本统计特征向量输入第一模型,通过第一模型中的权重模块对第一样本统计特征向量进行权重计算,得到第二样本统计特征向量。可选地,还可以根据第二样本统计特征向量得到第三样本统计特征向量,通过降维模块对第三样本统计特征向量进行降维处理,得到二维样本特征向量,该二维样本特征向量对应一个目标类别。
其中,权重模块包括目标权重矩阵,目标权重矩阵用于表示M维特征空间中每维特征空间的重要程度,即权重模块用于根据M维特征空间中每维特征空间的重要程度对第一样本统计特征向量进行权重计算得到第三样本统计特征向量;降维模块可以包括全连接层,用于减少第一模型训练中的计算量,例如得到的第三样本统计特征向量为高维特征矩阵,通过降维模块对高维特征矩阵进行降维可以得到低维特征矩阵,例如高维特征矩阵为2M维,通过降维模块进行降维可以得到2维的低维特征矩阵,通过降维处理可以减少模型训练中的计算量。
所述第一模型训练模块604,还用于根据所述第二样本统计特征向量,计算所述第一模型的第一损失。
这里,第一模型训练模块604根据第二样本统计特征向量,计算第一模型的第一损失即第一模型训练模块604根据第一样本均值向量和第一样本标准差向量,计算第一模型的第一损失。
具体实现中,第一模型训练模块604由于预先确定了第一样本语音信号的目标类别,且通过第一模型对第一样本语音信号对应的第一样本统计特征向量进行处理得到第二样本统计特征向量,并根据第二样本统计特征向量得到第三样本统计特征向量,以及通过降维模块对第三样本统计特征向量进行降维处理,得到的二维样本特征向量对应一个目标类别,根据预先确定的第一样本语音信号的目标类别与二维样本特征向量对应的目标类别之间的相似度,计算第一模型的第一损失。这里,相似度越高,则第一损失越小,相似度越低,则第一损失越大,其中,第一损失可以为交叉熵损失。
所述第一模型训练模块604,还用于根据所述第一损失,训练所述第一模型。
这里,在第一损失较大的情况下,第一模型训练模块604可以采用梯度下降法对第一模型进行调整,即对第一模型中的权重模块进行调整,第一模型训练模块604还可以采用梯度下降法对降维模块进行调整,使得模型训练的参数更准确,从而使得通过第一模型处理后得到的第二统计特征向量能更准确的反映第一样本语音信号的类别。
在一种可能的设计中,所述装置60还包括:语音信号获取模块605,用于获取第一语音信号,所述第一语音信号为录音重放语音信号。
这里,语音信号获取模块605可以通过录音设备等对原始语音信号进行录音得到录音重放语音信号。
所述语音信号获取模块605,还用于获取所述第一语音信号的第二特征向量,并将所述第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,所述第四统计特征 向量用于表示所述第一语音信号的统计特征。
这里,第二特征向量可以包括LPCC特征向量、MFCC特征向量、或者CQCC特征向量,语音信号获取模块605可以通过对第一语音信号进行LPCC特征提取、MFCC特征提取、或者CQCC特征提取得到第二特征向量。具体语音信号获取模块605对第一语音信号进行CQCC特征提取得到CQCC特征向量的方法可参考步骤S101中的描述,此处不再赘述。
本申请实施例中,语音信号获取模块605通过对第一语音信号进行特征提取得到第二特征向量,并将第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,第四统计特征向量包括第三均值向量和第三标准差向量,第三均值向量用于表示第一语音信号在M维特征空间中每维特征空间的均值,第三标准差向量用于表示第一语音信号在M维特征空间中每维特征空间的标准差。
所述语音信号获取模块605,还用于根据所述第四统计特征向量构建第一隐含向量,并将所述第一隐含向量输入解码模型进行解码处理,获得第三特征向量,其中,由所述第三特征向量生成的第二语音信号与所述第一语音信号之间的相似度满足目标条件。
这里,目标条件为第二语音信号与第一语音信号之间的相似度满足相似度阈值,例如,相似度阈值可以为80%、90%、95%等数值,即第一语音信号与第二语音信号为相似度较高的两个语音信号,即通过该方式可以通过一个录音重放的语音信号生成一个与该录音重放语音信号相似度很高的录音重放语音信号,比如,存在X个录音重放语音信号,则通过上述方式可以产生2X个录音重放语音信号,进一步将该2X个录音重放语音信号用于训练第一模型。
这里,语音信号获取模块605根据第四统计特征向量构建第一隐含向量即根据第三均值向量和第三标准差向量构建第一隐含向量,并将第一隐含向量输入解码模型进行解码处理,获得第三特征向量。
可选地,编码模型和解码模型可以为变分自编码器(Variational Autoencoder,VAE)中的编码层和解码层,语音信号获取模块605在使用编码模型和解码模型之前,可以预先对编码模型与解码模型进行训练,使得训练后的编码模型以及解码模型更准确,从而使得得到的第二语音信号与输入编码模型中的第二特征向量对应的第一语音信号之间的相似度更高,对编码模型与解码模型进行训练的方式可参考图5。
所述语音信号获取模块605,还用于若所述第一样本语音信号为录音重放语音信号,则所述第一样本语音信号为所述第一语音信号或者所述第二语音信号。
在一种可能的设计中,所述装置60还包括:
第二模型训练模块606,用于获取第二样本语音信号对应的第一样本特征向量。
具体第二模型训练模块606获取第二样本语音信号对应的第一样本特征向量的方法可参考步骤S302中获取第一语音信号的第二特征向量的方法,此处不再赘述。
所述第二模型训练模块606,还用于将所述第一样本特征向量输入所述编码模型进行编码处理,获得第三样本统计特征向量,所述第三样本统计特征向量用于表示所述第二样本语音信号的统计特征。
这里,第三样本统计特征向量包括第二样本均值向量和第二样本标准差向量,第二样本均值向量用于表示第二样本语音信号在M维特征空间中每维特征空间的均值,第二样本标准差向量用于表示第二样本语音信号在M维特征空间中每维特征空间的标准差。即第二模型训练模块606将第一样本特征向量输入编码模型进行编码处理,获得第二样本均值向量和第二样本标准差向量。
所述第二模型训练模块606,还用于根据所述第三样本统计特征向量和标准正态分布函数,确定第二损失。
这里,第二模型训练模块606可以先根据第二样本均值向量和第二样本标准差向量确定第一正态分布函数,再根据第一正态分布函数与标准正态分布函数之间的重合度确定第二损失,其中,第一正态分布函数与标准正态分布函数之间的重合度越高,第二损失越小,第一正态分布函数与标准正态分布函数之间的重合度越低,第二损失越大,其中,第二损失可以为散度损失,第一正态分布函数与标准正态分布函数之间的重合度即在坐标轴上第一正态分布函数对应的图形与在该坐标轴上标准正态分布函数对应的图形之间的重合度。
所述第二模型训练模块606,还用于根据所述第三样本统计特征向量构建第一样本隐含向量,并将所述第一样本隐含向量输入所述解码模型进行解码处理,获得第二样本特征向量。
具体实现中,第二模型训练模块606可以通过标准正态分布函数乘以第二样本标准差向量,再加上第二样本均值向量得到第一样本隐含向量,即第一样本隐含向量可以为标准正态分布函数与第二样本标准差向量之积,再与第二样本均值向量之和。第二模型训练模块606将第一样本隐含向量输入解码模型进行解码处理,获得第二样本特征向量。这里,第二样本特征向量即通过将第一样本特征向量输入编码模型与解码模型后构造得到的特征向量。
所述第二模型训练模块606,还用于根据所述第一样本特征向量和所述第二样本特征向量,确定第三损失。
这里,第二模型训练模块606可以根据第一样本特征向量和第二样本特征向量的相似度确定第三损失,即第一样本特征向量和第二样本特征向量的相似度越高,则第三损失越小;第一样本特征向量和第二样本特征向量的相似度越低,则第三损失越大,其中,第三损失可以为交叉熵损失。
所述第二模型训练模块606,还用于根据所述第二损失和所述第三损失,训练所述编码模型和所述解码模型。
在第二损失较大的情况下,第二模型训练模块606可以采用梯度下降法对编码模型中的参数进行调整,在第三损失较大的情况下,可以采用梯度下降法对解码模型中的参数进行调整,使得调整后的编码模型和解码模型更准确,从而通过编码模型和解码模型处理得到的第二样本特征向量与第一样本特征向量之间的相似度更高。
这里,第一语音信号与第二语音信号都为录音重放语音信号,第一语音信号为输入编码模型的第一样本特征向量对应的语音信号,第二语音信号为从解码模型中输出的第二样 本特征向量对应的语音信号。可选地,录音重放语音信号可以包括转录语音信号,转录语音信号即通过编码模型与解码模型对录音重放语音信号进行转录得到的语音信号,即第二语音信号。在训练第一模型时,可以使用原始语音信号或者录音重放语音信号,录音重放语音信号也包括转录语音信号,即通过将转录语音信号对应的统计特征向量输入第一模型处理后得到的转录语音信号的目标类别为录音重放语音信号。
需要说明的是,图6对应的实施例中未提及的内容可参见方法实施例的描述,这里不再赘述。
本申请实施例中,通过获取待处理语音信号对应的第一统计特征向量,将第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,并根据第二统计特征向量确定待处理语音信号的目标类别,从而确定待处理语音信号为原始语音信号或者录音重放语音信号,由于根据M维特征空间中每维特征空间的重要程度对第一统计特征向量进行处理,强化了M维特征空间中每维特征空间的统计特征,可以更准确的反映待处理语音信号的统计特征,从而准确地确定出待处理语音信号的目标类别,提高录音重放检测的准确性,并且不需要对待处理语音信号的内容进行检测,提高检测效率,适用性强;由于使用了大量的样本语音信号对第一模型进行训练,因此训练后的第一模型更准确,从而使得录音重放检测结果更准确;通过使用编码模型和解码模型对样本语音信号对应的统计特征向量进行处理,可快速得到大量样本语音信号,相较于通过录音设备对样本语音信号进行录音的方式而言,该方法获取大量样本语音信号的效率较高。
参见图7,图7是本申请实施例提供的一种语音信号处理设备的组成结构示意图,该设备70包括处理器701、存储器702以及输入输出接口703。处理器701连接到存储器702和输入输出接口703,例如处理器701可以通过总线连接到存储器702和输入输出接口703。
处理器701被配置为支持所述语音信号处理设备执行图1-图2、图4所述的语音信号处理方法中相应的功能。该处理器701可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP),硬件芯片或者其任意组合。上述硬件芯片可以是专用集成电路(application specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
存储器702用于存储程序代码等。存储器702可以包括易失性存储器(volatile memory,VM),例如随机存取存储器(random access memory,RAM);存储器702也可以包括非易失性存储器(non-volatile memory,NVM),例如只读存储器(read-only memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器702还可以包括上述种类的存储器的组合。
所述输入输出接口703用于输入或输出数据。
处理器701可以调用所述程序代码以执行以下操作:获取待处理语音信号对应的第一 统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
需要说明的是,各个操作的实现还可以对应参照上述方法实施例的相应描述;所述处理器701还可以与输入输出接口703配合执行上述方法实施例中的其他操作。
本申请实施例还提供一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被计算机执行时使所述计算机执行如前述实施例所述的方法,所述计算机可以为上述提到的语音信号处理设备的一部分。例如为上述的处理器701。所述计算机可读存储介质可以是非易失性,也可以是易失性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种语音信号处理方法,其中,包括:
    获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;
    将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;
    根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
  2. 如权利要求1所述的方法,其中,所述获取待处理语音信号对应的第一统计特征向量,包括:
    将所述待处理语音信号划分为N个语音帧,所述N为大于或者等于1的整数;
    获取所述N个语音帧中每个语音帧的第一特征向量,所述第一特征向量用于表示所述语音帧在M维特征空间中每维特征空间的特征值;
    针对所述M维特征空间中的每维特征空间,计算该维特征空间对应的统计值,所述统计值为所述N个语音帧在该维特征空间的统计值;
    根据所述M维特征空间中每维特征空间对应的统计值,构建所述待处理语音信号对应的第一统计特征向量。
  3. 如权利要求1或2所述的方法,其中,所述第一统计特征向量包括第一均值向量和/或第一标准差向量,所述第一均值向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的均值,所述第一标准差向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的标准差。
  4. 如权利要求3所述的方法,其中,若所述第一统计特征向量包括所述第一均值向量和所述第一标准差向量;所述第二统计特征向量包括第二均值向量和第二标准差向量,所述第二均值向量是根据所述第一均值向量和所述第一模型得到,所述第二标准差向量是根据所述第一标准向量和所述第一模型得到;
    所述根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,包括:
    根据所述第二均值向量和所述第二标准差向量,构建第三统计特征向量;
    根据所述第三统计特征向量,确定所述待处理语音信号的目标类别。
  5. 如权利要求1所述的方法,其中,所述获取待处理语音信号对应的第一统计特征向量之前,还包括:
    获取第一样本语音信号对应的第一样本统计特征向量,所述第一样本统计特征向量用于表示所述第一样本语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数,所述第一样本语音信号为录音重放语音信号或者原始语音信号;
    将所述第一样本统计特征向量输入所述第一模型进行处理,获得第二样本统计特征向量;
    根据所述第二样本统计特征向量,计算所述第一模型的第一损失;
    根据所述第一损失,训练所述第一模型。
  6. 如权利要求5所述的方法,其中,所述获取待处理语音信号对应的第一统计特征向量之前,还包括:
    获取第一语音信号,所述第一语音信号为录音重放语音信号;
    获取所述第一语音信号的第二特征向量,并将所述第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,所述第四统计特征向量用于表示所述第一语音信号的统计特征;
    根据所述第四统计特征向量构建第一隐含向量,并将所述第一隐含向量输入解码模型进行解码处理,获得第三特征向量,其中,由所述第三特征向量生成的第二语音信号与所述第一语音信号之间的相似度满足目标条件;
    若所述第一样本语音信号为录音重放语音信号,则所述第一样本语音信号为所述第一语音信号或者所述第二语音信号。
  7. 如权利要求6所述的方法,其中,所述获取第一语音信号之前,还包括:
    获取第二样本语音信号对应的第一样本特征向量;
    将所述第一样本特征向量输入所述编码模型进行编码处理,获得第三样本统计特征向量,所述第三样本统计特征向量用于表示所述第二样本语音信号的统计特征;
    根据所述第三样本统计特征向量和标准正态分布函数,确定第二损失;
    根据所述第三样本统计特征向量构建第一样本隐含向量,并将所述第一样本隐含向量输入所述解码模型进行解码处理,获得第二样本特征向量;
    根据所述第一样本特征向量和所述第二样本特征向量,确定第三损失;
    根据所述第二损失和所述第三损失,训练所述编码模型和所述解码模型。
  8. 一种语音信号处理装置,其中,包括:
    第一特征获取模块,用于获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;
    第二特征获取模块,用于将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;
    目标类别确定模块,用于根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
  9. 一种语音信号处理设备,其中,包括处理器、存储器以及输入输出接口,所述处理器、存储器和输入输出接口相互连接,其中,所述输入输出接口用于输入或输出数据,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,其中:
    获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;
    将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;
    根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
  10. 根据权利要求9所述的语音信号处理设备,其中,所述处理器用于:
    将所述待处理语音信号划分为N个语音帧,所述N为大于或者等于1的整数;
    获取所述N个语音帧中每个语音帧的第一特征向量,所述第一特征向量用于表示所述语音帧在M维特征空间中每维特征空间的特征值;
    针对所述M维特征空间中的每维特征空间,计算该维特征空间对应的统计值,所述统计值为所述N个语音帧在该维特征空间的统计值;
    根据所述M维特征空间中每维特征空间对应的统计值,构建所述待处理语音信号对应的第一统计特征向量。
  11. 根据权利要求9或10所述的语音信号处理设备,其中,所述第一统计特征向量包括第一均值向量和/或第一标准差向量,所述第一均值向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的均值,所述第一标准差向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的标准差。
  12. 根据权利要求11所述的语音信号处理设备,其中,若所述第一统计特征向量包括所述第一均值向量和所述第一标准差向量;所述第二统计特征向量包括第二均值向量和第二标准差向量,所述第二均值向量是根据所述第一均值向量和所述第一模型得到,所述第二标准差向量是根据所述第一标准向量和所述第一模型得到;
    所述处理器用于:
    根据所述第二均值向量和所述第二标准差向量,构建第三统计特征向量;
    根据所述第三统计特征向量,确定所述待处理语音信号的目标类别。
  13. 根据权利要求9所述的语音信号处理设备,其中,所述处理器用于:
    获取第一样本语音信号对应的第一样本统计特征向量,所述第一样本统计特征向量用于表示所述第一样本语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数,所述第一样本语音信号为录音重放语音信号或者原始语音信号;
    将所述第一样本统计特征向量输入所述第一模型进行处理,获得第二样本统计特征向量;
    根据所述第二样本统计特征向量,计算所述第一模型的第一损失;
    根据所述第一损失,训练所述第一模型。
  14. 根据权利要求13所述的语音信号处理设备,其中,所述处理器用于:
    获取第一语音信号,所述第一语音信号为录音重放语音信号;
    获取所述第一语音信号的第二特征向量,并将所述第二特征向量输入编码模型进行编码处理,获得第四统计特征向量,所述第四统计特征向量用于表示所述第一语音信号的统 计特征;
    根据所述第四统计特征向量构建第一隐含向量,并将所述第一隐含向量输入解码模型进行解码处理,获得第三特征向量,其中,由所述第三特征向量生成的第二语音信号与所述第一语音信号之间的相似度满足目标条件;
    若所述第一样本语音信号为录音重放语音信号,则所述第一样本语音信号为所述第一语音信号或者所述第二语音信号。
  15. 根据权利要求14所述的语音信号处理设备,其中,所述处理器用于:
    获取第二样本语音信号对应的第一样本特征向量;
    将所述第一样本特征向量输入所述编码模型进行编码处理,获得第三样本统计特征向量,所述第三样本统计特征向量用于表示所述第二样本语音信号的统计特征;
    根据所述第三样本统计特征向量和标准正态分布函数,确定第二损失;
    根据所述第三样本统计特征向量构建第一样本隐含向量,并将所述第一样本隐含向量输入所述解码模型进行解码处理,获得第二样本特征向量;
    根据所述第一样本特征向量和所述第二样本特征向量,确定第三损失;
    根据所述第二损失和所述第三损失,训练所述编码模型和所述解码模型。
  16. 一种计算机存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下步骤:
    获取待处理语音信号对应的第一统计特征向量,所述第一统计特征向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数;
    将所述第一统计特征向量输入第一模型进行处理,获得第二统计特征向量,所述第一模型用于根据所述M维特征空间中每维特征空间的重要程度对所述第一统计特征向量进行处理;
    根据所述第二统计特征向量,确定所述待处理语音信号的目标类别,所述目标类别包括原始语音信号或者录音重放语音信号。
  17. 根据权利要16所述的计算机存储介质,其中,所述程序指令当被处理器执行时使所述处理器还执行以下步骤:
    将所述待处理语音信号划分为N个语音帧,所述N为大于或者等于1的整数;
    获取所述N个语音帧中每个语音帧的第一特征向量,所述第一特征向量用于表示所述语音帧在M维特征空间中每维特征空间的特征值;
    针对所述M维特征空间中的每维特征空间,计算该维特征空间对应的统计值,所述统计值为所述N个语音帧在该维特征空间的统计值;
    根据所述M维特征空间中每维特征空间对应的统计值,构建所述待处理语音信号对应的第一统计特征向量。
  18. 根据权利要16或17所述的计算机存储介质,其中,所述第一统计特征向量包括第一均值向量和/或第一标准差向量,所述第一均值向量用于表示所述待处理语音信号在M维特征空间中每维特征空间的均值,所述第一标准差向量用于表示所述待处理语音信号在 M维特征空间中每维特征空间的标准差。
  19. 根据权利要18所述的计算机存储介质,其中,若所述第一统计特征向量包括所述第一均值向量和所述第一标准差向量;所述第二统计特征向量包括第二均值向量和第二标准差向量,所述第二均值向量是根据所述第一均值向量和所述第一模型得到,所述第二标准差向量是根据所述第一标准向量和所述第一模型得到;
    所述程序指令当被处理器执行时使所述处理器还执行以下步骤:
    根据所述第二均值向量和所述第二标准差向量,构建第三统计特征向量;
    根据所述第三统计特征向量,确定所述待处理语音信号的目标类别。
  20. 根据权利要16所述的计算机存储介质,其中,所述程序指令当被处理器执行时使所述处理器还执行以下步骤:
    获取第一样本语音信号对应的第一样本统计特征向量,所述第一样本统计特征向量用于表示所述第一样本语音信号在M维特征空间中每维特征空间的统计值,所述M为大于1的整数,所述第一样本语音信号为录音重放语音信号或者原始语音信号;
    将所述第一样本统计特征向量输入所述第一模型进行处理,获得第二样本统计特征向量;
    根据所述第二样本统计特征向量,计算所述第一模型的第一损失;
    根据所述第一损失,训练所述第一模型。
PCT/CN2020/118120 2020-02-17 2020-09-27 语音信号处理方法、装置及设备 WO2021164256A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010096100.4A CN111292754A (zh) 2020-02-17 2020-02-17 语音信号处理方法、装置及设备
CN202010096100.4 2020-02-17

Publications (1)

Publication Number Publication Date
WO2021164256A1 true WO2021164256A1 (zh) 2021-08-26

Family

ID=71030044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118120 WO2021164256A1 (zh) 2020-02-17 2020-09-27 语音信号处理方法、装置及设备

Country Status (2)

Country Link
CN (1) CN111292754A (zh)
WO (1) WO2021164256A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292754A (zh) * 2020-02-17 2020-06-16 平安科技(深圳)有限公司 语音信号处理方法、装置及设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364656A (zh) * 2018-03-08 2018-08-03 北京得意音通技术有限责任公司 一种用于语音重放检测的特征提取方法及装置
WO2018160943A1 (en) * 2017-03-03 2018-09-07 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN108711436A (zh) * 2018-05-17 2018-10-26 哈尔滨工业大学 基于高频和瓶颈特征的说话人验证系统重放攻击检测方法
CN110136693A (zh) * 2018-02-09 2019-08-16 百度(美国)有限责任公司 用于使用少量样本进行神经话音克隆的系统和方法
CN110232927A (zh) * 2019-06-13 2019-09-13 苏州思必驰信息科技有限公司 说话人验证反欺骗方法和装置
CN110491391A (zh) * 2019-07-02 2019-11-22 厦门大学 一种基于深度神经网络的欺骗语音检测方法
CN111292754A (zh) * 2020-02-17 2020-06-16 平安科技(深圳)有限公司 语音信号处理方法、装置及设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018160943A1 (en) * 2017-03-03 2018-09-07 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN110136693A (zh) * 2018-02-09 2019-08-16 百度(美国)有限责任公司 用于使用少量样本进行神经话音克隆的系统和方法
CN108364656A (zh) * 2018-03-08 2018-08-03 北京得意音通技术有限责任公司 一种用于语音重放检测的特征提取方法及装置
CN108711436A (zh) * 2018-05-17 2018-10-26 哈尔滨工业大学 基于高频和瓶颈特征的说话人验证系统重放攻击检测方法
CN110232927A (zh) * 2019-06-13 2019-09-13 苏州思必驰信息科技有限公司 说话人验证反欺骗方法和装置
CN110491391A (zh) * 2019-07-02 2019-11-22 厦门大学 一种基于深度神经网络的欺骗语音检测方法
CN111292754A (zh) * 2020-02-17 2020-06-16 平安科技(深圳)有限公司 语音信号处理方法、装置及设备

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JALALUDDIN AKBAR MUHAMMAD: "A Overview of Spoof Speech Detection for Automatic Speaker Verification", RESEARCHGATE, 28 February 2019 (2019-02-28), XP055796300 *
KOJI OKABE; TAKAFUMI KOSHINAKA; KOICHI SHINODA: "Attentive Statistics Pooling for Deep Speaker Embedding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 March 2018 (2018-03-29), 201 Olin Library Cornell University Ithaca, NY 14853, XP080860268 *
MARI GANESH KUMAR; SUVIDHA RUPESH KUMAR; SARANYA M; B. BHARATHI; HEMA A. MURTHY: "Spoof detection using x-vector and feature switching", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 April 2019 (2019-04-16), 201 Olin Library Cornell University Ithaca, NY 14853, XP081169801 *
TOM FRANCIS, JAIN MOHIT, DEY PRASENJIT: "End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention", INTERSPEECH 2018, ISCA, ISCA, 1 January 2018 (2018-01-01) - 6 September 2018 (2018-09-06), ISCA, pages 681 - 685, XP055839829, DOI: 10.21437/Interspeech.2018-2279 *
WANG QIONGQIONG; OKABE KOJI; LEE KONG AIK; YAMAMOTO HITOSHI; KOSHINAKA TAKAFUMI: "Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding?", 2018 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), IEEE, 18 December 2018 (2018-12-18), pages 1052 - 1059, XP033517000, DOI: 10.1109/SLT.2018.8639586 *

Also Published As

Publication number Publication date
CN111292754A (zh) 2020-06-16

Similar Documents

Publication Publication Date Title
CN108900725B (zh) 一种声纹识别方法、装置、终端设备及存储介质
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
Bahmaninezhad et al. Convolutional Neural Network Based Speaker De-Identification.
US8831942B1 (en) System and method for pitch based gender identification with suspicious speaker detection
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN112786052B (zh) 语音识别方法、电子设备和存储装置
CN108564956B (zh) 一种声纹识别方法和装置、服务器、存储介质
WO2023001128A1 (zh) 音频数据的处理方法、装置及设备
US20110066426A1 (en) Real-time speaker-adaptive speech recognition apparatus and method
KR20160102815A (ko) 잡음에 강인한 오디오 신호 처리 장치 및 방법
US20230070000A1 (en) Speech recognition method and apparatus, device, storage medium, and program product
JP7329393B2 (ja) 音声信号処理装置、音声信号処理方法、音声信号処理プログラム、学習装置、学習方法及び学習プログラム
WO2021164256A1 (zh) 语音信号处理方法、装置及设备
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
Sorokin et al. Speaker verification using the spectral and time parameters of voice signal
CN108847251B (zh) 一种语音去重方法、装置、服务器及存储介质
WO2020003413A1 (ja) 情報処理装置、制御方法、及びプログラム
JP2022544984A (ja) ヒト話者の埋め込みを会話合成に適合させるためのシステムおよび方法
JP5091202B2 (ja) サンプルを用いずあらゆる言語を識別可能な識別方法
Prajapati et al. Voice Privacy Through x-Vector and CycleGAN-Based Anonymization.
US20230197093A1 (en) Neural pitch-shifting and time-stretching
CN114694689A (zh) 声音信号处理评估方法和装置
Shi et al. Anti-replay: A fast and lightweight voice replay attack detection system
CN117649846B (zh) 语音识别模型生成方法、语音识别方法、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20919639

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20919639

Country of ref document: EP

Kind code of ref document: A1