CN114267363A

CN114267363A - Voice countercheck sample generation method and device, electronic equipment and storage medium

Info

Publication number: CN114267363A
Application number: CN202210201797.6A
Authority: CN
Inventors: 傅睿博; 陶建华; 易江燕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-04-01
Anticipated expiration: 2042-03-03
Also published as: CN114267363B

Abstract

The present disclosure relates to a method and an apparatus for generating a voice confrontation sample, an electronic device and a storage medium, wherein the method comprises: receiving a target text and extracting a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.

Description

Voice countercheck sample generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of voice technologies, and in particular, to a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.

Background

At present, in order to capture more discriminative information, a speech generation detection model uses a plurality of acoustic features for speech signal processing, and the acoustic features for speech generation detection are directly sent to the model or used as a criterion. When a voice countermeasure sample is generated, the voice synthesis model usually selects only one voice acoustic feature to perform acoustic model modeling, and reconstructs the parameter into a voice waveform by using the vocoder, so that under the condition that the acoustic parameter adopted by the voice synthesis model is inconsistent with the acoustic parameter used by the voice generation detection model, the parameter of the detection feature used for generating voice is very different from the real voice, so that the voice is very easy to detect by the voice generation detection model, and the voice generation detection system cannot be deceived.

In addition, in the prior art, a voice countermeasure sample is generated mainly by adding random disturbance to an error threshold value, clamping the error and the like, and belongs to passive addition of countermeasure samples, although a voice generation detection model can be deceived to a certain extent, the hearing of generated voice is easy to decline due to added noise, the generated voice is easy to recognize and detect from the human subjective angle, and the method is not based on a voice generation detection mechanism, the generation of the countermeasure sample is too limited, and only part of given voice generation detection models can be deceived effectively.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:

receiving a target text and extracting a text feature sequence from the target text;

inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;

and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.

In a possible implementation manner, the acoustic model includes a backbone network, a self-attention mechanism layer, and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:

inputting the text characteristic sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;

inputting the intermediate multidimensional acoustic parameter sequence into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;

and inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a full connection layer to obtain a multidimensional acoustic parameter sequence.

In a possible implementation, the inputting the intermediate multidimensional acoustic parameter sequence into the attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:

for intermediate multidimensional acoustic parameter sequences

Vector of

Are multiplied by three weight coefficients respectively

Three vectors are obtained:

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix Q;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix K;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix V as an intermediate multidimensional acoustic parameter matrix;

and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:

wherein the content of the first and second substances,

the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;

for the ith vector in the intermediate multidimensional acoustic parameter sequence

Multiplication by

The vector obtained is then used as a basis for determining,

for the jth vector in the intermediate multidimensional acoustic parameter sequence

Multiplication by

Obtaining a vector;

to pair

The formed matrix A is normalized to obtain a matrix

As a vector correlation matrix.

In a possible embodiment, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the fully-connected layer to obtain a multidimensional acoustic parameter sequence includes:

wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix is

FCN is the full connection layer for vector correlation matrices.

In one possible embodiment, the vocoder model is trained by:

and taking the multi-dimensional acoustic parameter sequence Y as input and the time domain sampling sequence as output to train a neural network model to obtain a vocoder model.

In one possible implementation, the acoustic model is trained by the following expression:

wherein the content of the first and second substances,

a loss function is trained for the acoustic model,

in order to be the weight coefficient,

is a true vector of the qth acoustic parameter of the kth frame,

and (5) actually predicting the vector for the model of the q-th class acoustic parameter of the k-th frame.

In a possible implementation manner, the true vector of the q-th acoustic parameter of the k-th frame is obtained by the following steps:

extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;

splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;

and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.

In a second aspect, an embodiment of the present disclosure provides a speech confrontation sample generation apparatus, including:

the extraction module is used for receiving a target text and extracting a text feature sequence from the target text;

the input module is used for inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;

and the generating module is used for inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a confrontation sample corresponding to the target text.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the voice countermeasure sample generation method when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above-mentioned voice countermeasure sample generation method.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:

the voice countermeasure sample generation method of the embodiment of the disclosure receives a target text and extracts a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 schematically illustrates a flow diagram of a method of generating a speech confrontation sample according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a method for generating a speech confrontation sample according to another embodiment of the present disclosure;

fig. 3 schematically shows a block diagram of a structure of a speech countermeasure sample generation apparatus according to an embodiment of the present disclosure; and

fig. 4 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:

s1, receiving a target text and extracting a text feature sequence from the target text;

in practical application, for a target text of a confrontation sample to be generated, text features in conventional speech synthesis are obtained through text regularization, text-to-phoneme, polyphone prediction, prosodic pause and other predictions, and for each phoneme f, the text features

Contains a series of characteristics such as phoneme information, tone information, part of speech information, prosodic pause information and the like which can be helpful for acoustic modeling. Assuming that a sample text of the training corpus contains N phoneme units, defining a text characteristic sequence after processing and quantization as

。

S2, inputting the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;

and S3, inputting the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generating a time domain sampling sequence (time domain voice waveform) of voice as a countermeasure sample corresponding to the target text, wherein the structure of the vocoder model can adopt a deep neural network, and is similar to the WaveRNN structure.

Referring to fig. 2, in step S2, the acoustic model includes a backbone network, an attention mechanism layer, and a full connection layer, and the step of inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:

s21, inputting the text feature sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;

in practical application, the backbone network may adopt a neural network structure of an encoder-decoder, wherein the encoder-decoder structure may apply various forms of deep neural networks such as LSTM, CNN, and the like.

S22, inputting the intermediate multidimensional acoustic parameter sequence into an attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;

and S23, inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain a multidimensional acoustic parameter sequence.

In this embodiment, in step S22, the inputting the intermediate multidimensional acoustic parameter sequence into the self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:

for intermediate multidimensional acoustic parameter sequences

Vector of

Are multiplied by three weight coefficients respectively

Three vectors are obtained:

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix Q;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix K;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

wherein the content of the first and second substances,

Multiplication by

The vector obtained is then used as a basis for determining,

Multiplication by

Obtaining a vector;

to pair

The formed matrix A is normalized to obtain a matrix

As a vector correlation matrix.

In this embodiment, in step S23, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain the multidimensional acoustic parameter sequence includes:

FCN is the full connection layer for vector correlation matrices.

In this embodiment, in step S3, the vocoder model is obtained by training the following steps:

taking a multidimensional acoustic parameter sequence Y as input, taking a time domain sampling sequence as output to train a neural network model, and obtaining a vocoder model, wherein the acoustic model is trained through the following expression:

wherein the content of the first and second substances,

a loss function is trained for the acoustic model,

a weighting coefficient for adjusting the degree of emphasis generated for each dimension of acoustic parameters according to the speech generation detection model,

is a true vector of the qth acoustic parameter of the kth frame,

the model actual prediction vector for the qth class of acoustic parameters for the kth frame,

represents a norm calculation formula of order 1, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by the following steps:

The disclosed method for generating voice countermeasure samples is different from the conventional method for adding global relatively uniform countermeasure samples to the generated whole sentence of voice, adding phoneme related countermeasure samples in the process of generating voice according to different generated phonemes, generating voice countermeasure samples more specifically, and comprehensively reconstructing acoustic parameters commonly used by a voice generation detection system in the process of generating countermeasure samples, such as Mel Frequency Cepstral Coefficient (MFCC), Linear Frequency Cepstral Coefficient (LFCC), Linear Predictive Coefficient (LPC), constant Q-transform Cepstral Coefficient (CQCC), so that the generated voice is closer to the real voice in parameter distribution, namely an active attack means, and the multi-dimensional parameter reconstruction method can cheat various voice generation detection models, and the generated countermeasure samples cannot be correctly identified by the voice generation detection system, compared with the traditional method for generating the confrontation sample, the method is based on the mechanism of voice generation and detection, adopts the phoneme fine-grained means to generate the phoneme-related voice confrontation sample by generating different voice contents and utilizes the multi-dimensional parameter reconstruction method to improve the anti-detection capability of the voice confrontation sample

The voice confrontation sample generation method is based on the voice generation detection mechanism, can explainably improve the generation effect of the voice confrontation sample, can cheat the voice generation detection model more effectively, and is easy to operate and realize.

Referring to fig. 3, an embodiment of the present disclosure provides a voice confrontation sample generation apparatus, including:

the extraction module 11 is configured to receive a target text and extract a text feature sequence from the target text;

the input module 12 is configured to input the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;

and a generating module 13, configured to input the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generate a time-domain sampling sequence of speech as a countermeasure sample corresponding to the target text.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

In this embodiment, any plurality of the extraction module 11, the input module 12, and the generation module 13 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. At least one of the extraction module 11, the input module 12 and the generation module 13 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware and firmware, or in a suitable combination of any of them. Alternatively, at least one of the extraction module 11, the input module 12 and the generation module 13 may be at least partly implemented as a computer program module, which when executed may perform a corresponding function.

Referring to fig. 4, an electronic device provided by an embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;

a memory 1130 for storing computer programs;

the processor 1110, when executing the program stored in the memory 1130, implements the method for generating a speech countermeasure sample as follows:

The communication bus 1140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 1120 is used for communication between the electronic device and other devices.

The Memory 1130 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.

The Processor 1110 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for generating a speech countermeasure sample as described above.

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The above-mentioned computer-readable storage medium carries one or more programs which, when executed, implement the speech countermeasure sample generation method according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for generating a speech confrontation sample, comprising the steps of:

2. The method of claim 1, wherein the acoustic model comprises a backbone network, an attention mechanism layer and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence comprises:

3. The method of claim 2, wherein the inputting the sequence of intermediate multidimensional acoustic parameters into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix comprises:

for intermediate multidimensional acoustic parameter sequences

Vector of

Are multiplied by three weight coefficients respectively

Three vectors are obtained:

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix Q;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

Forming a matrix K;

from vectors in intermediate multidimensional acoustic parameter sequences

Corresponding to

wherein the content of the first and second substances,

Multiplication by

The vector obtained is then used as a basis for determining,

Multiplication by

Obtaining a vector;

to pair

The formed matrix A is normalized to obtain a matrix

As a vector correlation matrix.

4. The method of claim 3, wherein inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a fully connected layer to obtain a multidimensional acoustic parameter sequence comprises:

FCN is the full connection layer for vector correlation matrices.

5. The method of claim 1, wherein the vocoder model is trained by:

6. The method of claim 1, wherein the acoustic model is trained by the following expression:

wherein the content of the first and second substances,

a loss function is trained for the acoustic model,

in order to be the weight coefficient,

is a true vector of the qth acoustic parameter of the kth frame,

7. The method according to claim 6, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by:

8. A speech confrontation sample generation apparatus, comprising:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of generating a speech countermeasure sample according to any one of claims 1 to 7 when executing a program stored in a memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the speech countermeasure sample generation method of any one of claims 1 to 7.