CN114267363A - Voice countercheck sample generation method and device, electronic equipment and storage medium - Google Patents

Voice countercheck sample generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114267363A
CN114267363A CN202210201797.6A CN202210201797A CN114267363A CN 114267363 A CN114267363 A CN 114267363A CN 202210201797 A CN202210201797 A CN 202210201797A CN 114267363 A CN114267363 A CN 114267363A
Authority
CN
China
Prior art keywords
acoustic parameter
sequence
matrix
vector
multidimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210201797.6A
Other languages
Chinese (zh)
Other versions
CN114267363B (en
Inventor
傅睿博
陶建华
易江燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210201797.6A priority Critical patent/CN114267363B/en
Publication of CN114267363A publication Critical patent/CN114267363A/en
Application granted granted Critical
Publication of CN114267363B publication Critical patent/CN114267363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present disclosure relates to a method and an apparatus for generating a voice confrontation sample, an electronic device and a storage medium, wherein the method comprises: receiving a target text and extracting a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.

Description

Voice countercheck sample generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of voice technologies, and in particular, to a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.
Background
At present, in order to capture more discriminative information, a speech generation detection model uses a plurality of acoustic features for speech signal processing, and the acoustic features for speech generation detection are directly sent to the model or used as a criterion. When a voice countermeasure sample is generated, the voice synthesis model usually selects only one voice acoustic feature to perform acoustic model modeling, and reconstructs the parameter into a voice waveform by using the vocoder, so that under the condition that the acoustic parameter adopted by the voice synthesis model is inconsistent with the acoustic parameter used by the voice generation detection model, the parameter of the detection feature used for generating voice is very different from the real voice, so that the voice is very easy to detect by the voice generation detection model, and the voice generation detection system cannot be deceived.
In addition, in the prior art, a voice countermeasure sample is generated mainly by adding random disturbance to an error threshold value, clamping the error and the like, and belongs to passive addition of countermeasure samples, although a voice generation detection model can be deceived to a certain extent, the hearing of generated voice is easy to decline due to added noise, the generated voice is easy to recognize and detect from the human subjective angle, and the method is not based on a voice generation detection mechanism, the generation of the countermeasure sample is too limited, and only part of given voice generation detection models can be deceived effectively.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
In a possible implementation manner, the acoustic model includes a backbone network, a self-attention mechanism layer, and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:
inputting the text characteristic sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
inputting the intermediate multidimensional acoustic parameter sequence into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a full connection layer to obtain a multidimensional acoustic parameter sequence.
In a possible implementation, the inputting the intermediate multidimensional acoustic parameter sequence into the attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:
for intermediate multidimensional acoustic parameter sequences
Figure 333505DEST_PATH_IMAGE001
Vector of
Figure 685989DEST_PATH_IMAGE002
Are multiplied by three weight coefficients respectively
Figure DEST_PATH_IMAGE003
Three vectors are obtained:
Figure 815488DEST_PATH_IMAGE004
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 253423DEST_PATH_IMAGE002
Corresponding to
Figure DEST_PATH_IMAGE005
Forming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 674040DEST_PATH_IMAGE002
Corresponding to
Figure 146609DEST_PATH_IMAGE006
Forming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 932163DEST_PATH_IMAGE002
Corresponding to
Figure DEST_PATH_IMAGE007
Forming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
Figure 857393DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;
Figure 81701DEST_PATH_IMAGE005
for the ith vector in the intermediate multidimensional acoustic parameter sequence
Figure 97193DEST_PATH_IMAGE010
Multiplication by
Figure DEST_PATH_IMAGE011
The vector obtained is then used as a basis for determining,
Figure 381544DEST_PATH_IMAGE012
for the jth vector in the intermediate multidimensional acoustic parameter sequence
Figure DEST_PATH_IMAGE013
Multiplication by
Figure 794071DEST_PATH_IMAGE014
Obtaining a vector;
to pair
Figure 556490DEST_PATH_IMAGE009
The formed matrix A is normalized to obtain a matrix
Figure DEST_PATH_IMAGE015
As a vector correlation matrix.
In a possible embodiment, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the fully-connected layer to obtain a multidimensional acoustic parameter sequence includes:
Figure 675756DEST_PATH_IMAGE016
wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix is
Figure DEST_PATH_IMAGE017
FCN is the full connection layer for vector correlation matrices.
In one possible embodiment, the vocoder model is trained by:
and taking the multi-dimensional acoustic parameter sequence Y as input and the time domain sampling sequence as output to train a neural network model to obtain a vocoder model.
In one possible implementation, the acoustic model is trained by the following expression:
Figure 131008DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE019
a loss function is trained for the acoustic model,
Figure 217782DEST_PATH_IMAGE020
in order to be the weight coefficient,
Figure DEST_PATH_IMAGE021
is a true vector of the qth acoustic parameter of the kth frame,
Figure 783892DEST_PATH_IMAGE022
and (5) actually predicting the vector for the model of the q-th class acoustic parameter of the k-th frame.
In a possible implementation manner, the true vector of the q-th acoustic parameter of the k-th frame is obtained by the following steps:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
In a second aspect, an embodiment of the present disclosure provides a speech confrontation sample generation apparatus, including:
the extraction module is used for receiving a target text and extracting a text feature sequence from the target text;
the input module is used for inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and the generating module is used for inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a confrontation sample corresponding to the target text.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the voice countermeasure sample generation method when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above-mentioned voice countermeasure sample generation method.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:
the voice countermeasure sample generation method of the embodiment of the disclosure receives a target text and extracts a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 schematically illustrates a flow diagram of a method of generating a speech confrontation sample according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a method for generating a speech confrontation sample according to another embodiment of the present disclosure;
fig. 3 schematically shows a block diagram of a structure of a speech countermeasure sample generation apparatus according to an embodiment of the present disclosure; and
fig. 4 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
Referring to fig. 1, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:
s1, receiving a target text and extracting a text feature sequence from the target text;
in practical application, for a target text of a confrontation sample to be generated, text features in conventional speech synthesis are obtained through text regularization, text-to-phoneme, polyphone prediction, prosodic pause and other predictions, and for each phoneme f, the text features
Figure DEST_PATH_IMAGE023
Contains a series of characteristics such as phoneme information, tone information, part of speech information, prosodic pause information and the like which can be helpful for acoustic modeling. Assuming that a sample text of the training corpus contains N phoneme units, defining a text characteristic sequence after processing and quantization as
Figure 819981DEST_PATH_IMAGE024
S2, inputting the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and S3, inputting the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generating a time domain sampling sequence (time domain voice waveform) of voice as a countermeasure sample corresponding to the target text, wherein the structure of the vocoder model can adopt a deep neural network, and is similar to the WaveRNN structure.
Referring to fig. 2, in step S2, the acoustic model includes a backbone network, an attention mechanism layer, and a full connection layer, and the step of inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:
s21, inputting the text feature sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
in practical application, the backbone network may adopt a neural network structure of an encoder-decoder, wherein the encoder-decoder structure may apply various forms of deep neural networks such as LSTM, CNN, and the like.
S22, inputting the intermediate multidimensional acoustic parameter sequence into an attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and S23, inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain a multidimensional acoustic parameter sequence.
In this embodiment, in step S22, the inputting the intermediate multidimensional acoustic parameter sequence into the self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:
for intermediate multidimensional acoustic parameter sequences
Figure 446135DEST_PATH_IMAGE001
Vector of
Figure 770937DEST_PATH_IMAGE002
Are multiplied by three weight coefficients respectively
Figure 875159DEST_PATH_IMAGE003
Three vectors are obtained:
Figure 31334DEST_PATH_IMAGE004
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 828389DEST_PATH_IMAGE002
Corresponding to
Figure 437224DEST_PATH_IMAGE005
Forming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 299132DEST_PATH_IMAGE002
Corresponding to
Figure 309814DEST_PATH_IMAGE006
Forming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 277770DEST_PATH_IMAGE002
Corresponding to
Figure 373902DEST_PATH_IMAGE007
Forming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
Figure 819927DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 888377DEST_PATH_IMAGE009
the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;
Figure 27234DEST_PATH_IMAGE005
for the ith vector in the intermediate multidimensional acoustic parameter sequence
Figure 610662DEST_PATH_IMAGE010
Multiplication by
Figure 860378DEST_PATH_IMAGE011
The vector obtained is then used as a basis for determining,
Figure 845651DEST_PATH_IMAGE012
for the jth vector in the intermediate multidimensional acoustic parameter sequence
Figure 155410DEST_PATH_IMAGE013
Multiplication by
Figure 413085DEST_PATH_IMAGE014
Obtaining a vector;
to pair
Figure 466491DEST_PATH_IMAGE009
The formed matrix A is normalized to obtain a matrix
Figure 306271DEST_PATH_IMAGE015
As a vector correlation matrix.
In this embodiment, in step S23, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain the multidimensional acoustic parameter sequence includes:
Figure 786931DEST_PATH_IMAGE016
wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix is
Figure 79372DEST_PATH_IMAGE017
FCN is the full connection layer for vector correlation matrices.
In this embodiment, in step S3, the vocoder model is obtained by training the following steps:
taking a multidimensional acoustic parameter sequence Y as input, taking a time domain sampling sequence as output to train a neural network model, and obtaining a vocoder model, wherein the acoustic model is trained through the following expression:
Figure 936470DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 568440DEST_PATH_IMAGE019
a loss function is trained for the acoustic model,
Figure 220001DEST_PATH_IMAGE020
a weighting coefficient for adjusting the degree of emphasis generated for each dimension of acoustic parameters according to the speech generation detection model,
Figure 265317DEST_PATH_IMAGE021
is a true vector of the qth acoustic parameter of the kth frame,
Figure 660527DEST_PATH_IMAGE022
the model actual prediction vector for the qth class of acoustic parameters for the kth frame,
Figure 209320DEST_PATH_IMAGE026
represents a norm calculation formula of order 1, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by the following steps:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
The disclosed method for generating voice countermeasure samples is different from the conventional method for adding global relatively uniform countermeasure samples to the generated whole sentence of voice, adding phoneme related countermeasure samples in the process of generating voice according to different generated phonemes, generating voice countermeasure samples more specifically, and comprehensively reconstructing acoustic parameters commonly used by a voice generation detection system in the process of generating countermeasure samples, such as Mel Frequency Cepstral Coefficient (MFCC), Linear Frequency Cepstral Coefficient (LFCC), Linear Predictive Coefficient (LPC), constant Q-transform Cepstral Coefficient (CQCC), so that the generated voice is closer to the real voice in parameter distribution, namely an active attack means, and the multi-dimensional parameter reconstruction method can cheat various voice generation detection models, and the generated countermeasure samples cannot be correctly identified by the voice generation detection system, compared with the traditional method for generating the confrontation sample, the method is based on the mechanism of voice generation and detection, adopts the phoneme fine-grained means to generate the phoneme-related voice confrontation sample by generating different voice contents and utilizes the multi-dimensional parameter reconstruction method to improve the anti-detection capability of the voice confrontation sample
The voice confrontation sample generation method is based on the voice generation detection mechanism, can explainably improve the generation effect of the voice confrontation sample, can cheat the voice generation detection model more effectively, and is easy to operate and realize.
Referring to fig. 3, an embodiment of the present disclosure provides a voice confrontation sample generation apparatus, including:
the extraction module 11 is configured to receive a target text and extract a text feature sequence from the target text;
the input module 12 is configured to input the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and a generating module 13, configured to input the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generate a time-domain sampling sequence of speech as a countermeasure sample corresponding to the target text.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
In this embodiment, any plurality of the extraction module 11, the input module 12, and the generation module 13 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. At least one of the extraction module 11, the input module 12 and the generation module 13 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware and firmware, or in a suitable combination of any of them. Alternatively, at least one of the extraction module 11, the input module 12 and the generation module 13 may be at least partly implemented as a computer program module, which when executed may perform a corresponding function.
Referring to fig. 4, an electronic device provided by an embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;
a memory 1130 for storing computer programs;
the processor 1110, when executing the program stored in the memory 1130, implements the method for generating a speech countermeasure sample as follows:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
The communication bus 1140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The Memory 1130 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.
The Processor 1110 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for generating a speech countermeasure sample as described above.
The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The above-mentioned computer-readable storage medium carries one or more programs which, when executed, implement the speech countermeasure sample generation method according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for generating a speech confrontation sample, comprising the steps of:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
2. The method of claim 1, wherein the acoustic model comprises a backbone network, an attention mechanism layer and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence comprises:
inputting the text characteristic sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
inputting the intermediate multidimensional acoustic parameter sequence into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a full connection layer to obtain a multidimensional acoustic parameter sequence.
3. The method of claim 2, wherein the inputting the sequence of intermediate multidimensional acoustic parameters into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix comprises:
for intermediate multidimensional acoustic parameter sequences
Figure 670052DEST_PATH_IMAGE001
Vector of
Figure 150712DEST_PATH_IMAGE002
Are multiplied by three weight coefficients respectively
Figure 443153DEST_PATH_IMAGE003
Three vectors are obtained:
Figure 300251DEST_PATH_IMAGE004
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 932221DEST_PATH_IMAGE002
Corresponding to
Figure 583782DEST_PATH_IMAGE005
Forming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 629098DEST_PATH_IMAGE002
Corresponding to
Figure 24307DEST_PATH_IMAGE006
Forming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequences
Figure 573101DEST_PATH_IMAGE002
Corresponding to
Figure 661142DEST_PATH_IMAGE007
Forming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
Figure 616591DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 815491DEST_PATH_IMAGE009
the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;
Figure 218791DEST_PATH_IMAGE005
for the ith vector in the intermediate multidimensional acoustic parameter sequence
Figure 477734DEST_PATH_IMAGE010
Multiplication by
Figure 232063DEST_PATH_IMAGE011
The vector obtained is then used as a basis for determining,
Figure 969075DEST_PATH_IMAGE012
for the jth vector in the intermediate multidimensional acoustic parameter sequence
Figure 492460DEST_PATH_IMAGE013
Multiplication by
Figure 859988DEST_PATH_IMAGE014
Obtaining a vector;
to pair
Figure 836034DEST_PATH_IMAGE009
The formed matrix A is normalized to obtain a matrix
Figure 376737DEST_PATH_IMAGE015
As a vector correlation matrix.
4. The method of claim 3, wherein inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a fully connected layer to obtain a multidimensional acoustic parameter sequence comprises:
Figure 20208DEST_PATH_IMAGE016
wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix is
Figure 355374DEST_PATH_IMAGE017
FCN is the full connection layer for vector correlation matrices.
5. The method of claim 1, wherein the vocoder model is trained by:
and taking the multi-dimensional acoustic parameter sequence Y as input and the time domain sampling sequence as output to train a neural network model to obtain a vocoder model.
6. The method of claim 1, wherein the acoustic model is trained by the following expression:
Figure 271246DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 350061DEST_PATH_IMAGE019
a loss function is trained for the acoustic model,
Figure 848038DEST_PATH_IMAGE020
in order to be the weight coefficient,
Figure 354106DEST_PATH_IMAGE021
is a true vector of the qth acoustic parameter of the kth frame,
Figure 570324DEST_PATH_IMAGE022
and (5) actually predicting the vector for the model of the q-th class acoustic parameter of the k-th frame.
7. The method according to claim 6, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
8. A speech confrontation sample generation apparatus, comprising:
the extraction module is used for receiving a target text and extracting a text feature sequence from the target text;
the input module is used for inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and the generating module is used for inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a confrontation sample corresponding to the target text.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of generating a speech countermeasure sample according to any one of claims 1 to 7 when executing a program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the speech countermeasure sample generation method of any one of claims 1 to 7.
CN202210201797.6A 2022-03-03 2022-03-03 Voice countercheck sample generation method and device, electronic equipment and storage medium Active CN114267363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210201797.6A CN114267363B (en) 2022-03-03 2022-03-03 Voice countercheck sample generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210201797.6A CN114267363B (en) 2022-03-03 2022-03-03 Voice countercheck sample generation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114267363A true CN114267363A (en) 2022-04-01
CN114267363B CN114267363B (en) 2022-05-24

Family

ID=80833816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210201797.6A Active CN114267363B (en) 2022-03-03 2022-03-03 Voice countercheck sample generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114267363B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859736A (en) * 2019-01-23 2019-06-07 北京光年无限科技有限公司 Phoneme synthesizing method and system
EP3599606A1 (en) * 2018-07-26 2020-01-29 Accenture Global Solutions Limited Machine learning for authenticating voice
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112786011A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device
CN113205792A (en) * 2021-04-08 2021-08-03 内蒙古工业大学 Mongolian speech synthesis method based on Transformer and WaveNet
CN114121010A (en) * 2021-11-30 2022-03-01 阿里巴巴(中国)有限公司 Model training, voice generation, voice interaction method, device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3599606A1 (en) * 2018-07-26 2020-01-29 Accenture Global Solutions Limited Machine learning for authenticating voice
CN109859736A (en) * 2019-01-23 2019-06-07 北京光年无限科技有限公司 Phoneme synthesizing method and system
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112786011A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device
CN113205792A (en) * 2021-04-08 2021-08-03 内蒙古工业大学 Mongolian speech synthesis method based on Transformer and WaveNet
CN114121010A (en) * 2021-11-30 2022-03-01 阿里巴巴(中国)有限公司 Model training, voice generation, voice interaction method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NAIHAN LI 等: "Neural speech synthesis with transformer network", 《PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Also Published As

Publication number Publication date
CN114267363B (en) 2022-05-24

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
Balamurali et al. Toward robust audio spoofing detection: A detailed comparison of traditional and learned features
Patton et al. AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN111916111A (en) Intelligent voice outbound method and device with emotion, server and storage medium
JP2019215500A (en) Voice conversion learning device, voice conversion device, method, and program
Chatterjee et al. Auditory model-based design and optimization of feature vectors for automatic speech recognition
CN111968652A (en) Speaker identification method based on 3DCNN-LSTM and storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
Rudresh et al. Performance analysis of speech digit recognition using cepstrum and vector quantization
CN110648655A (en) Voice recognition method, device, system and storage medium
Radha et al. Speech and speaker recognition using raw waveform modeling for adult and children’s speech: a comprehensive review
CN114267363B (en) Voice countercheck sample generation method and device, electronic equipment and storage medium
Praveen et al. Text dependent speaker recognition using MFCC features and BPANN
Rao Accent classification from an emotional speech in clean and noisy environments
Nijhawan et al. Speaker recognition using support vector machine
Bawa et al. Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions
Bakshi et al. Spoken Indian language classification using GMM supervectors and artificial neural networks
Bhaskar et al. Analysis of language identification performance based on gender and hierarchial grouping approaches
Ehkan et al. Hardware implementation of MFCC-based feature extraction for speaker recognition
Harere et al. Mispronunciation detection of basic quranic recitation rules using deep learning
Nijhawan et al. Real time speaker recognition system for hindi words
Srinivas LFBNN: robust and hybrid training algorithm to neural network for hybrid features-enabled speaker recognition system
CN116705036B (en) Multi-level feature fusion-based phrase voice speaker recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant