CN114267363A - Voice countercheck sample generation method and device, electronic equipment and storage medium - Google Patents
Voice countercheck sample generation method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114267363A CN114267363A CN202210201797.6A CN202210201797A CN114267363A CN 114267363 A CN114267363 A CN 114267363A CN 202210201797 A CN202210201797 A CN 202210201797A CN 114267363 A CN114267363 A CN 114267363A
- Authority
- CN
- China
- Prior art keywords
- acoustic parameter
- sequence
- matrix
- vector
- multidimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000005070 sampling Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 66
- 239000011159 matrix material Substances 0.000 claims description 60
- 238000004891 communication Methods 0.000 claims description 18
- 230000007246 mechanism Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000003062 neural network model Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Images
Abstract
The present disclosure relates to a method and an apparatus for generating a voice confrontation sample, an electronic device and a storage medium, wherein the method comprises: receiving a target text and extracting a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.
Description
Technical Field
The present disclosure relates to the field of voice technologies, and in particular, to a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.
Background
At present, in order to capture more discriminative information, a speech generation detection model uses a plurality of acoustic features for speech signal processing, and the acoustic features for speech generation detection are directly sent to the model or used as a criterion. When a voice countermeasure sample is generated, the voice synthesis model usually selects only one voice acoustic feature to perform acoustic model modeling, and reconstructs the parameter into a voice waveform by using the vocoder, so that under the condition that the acoustic parameter adopted by the voice synthesis model is inconsistent with the acoustic parameter used by the voice generation detection model, the parameter of the detection feature used for generating voice is very different from the real voice, so that the voice is very easy to detect by the voice generation detection model, and the voice generation detection system cannot be deceived.
In addition, in the prior art, a voice countermeasure sample is generated mainly by adding random disturbance to an error threshold value, clamping the error and the like, and belongs to passive addition of countermeasure samples, although a voice generation detection model can be deceived to a certain extent, the hearing of generated voice is easy to decline due to added noise, the generated voice is easy to recognize and detect from the human subjective angle, and the method is not based on a voice generation detection mechanism, the generation of the countermeasure sample is too limited, and only part of given voice generation detection models can be deceived effectively.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a method and an apparatus for generating a voice countermeasure sample, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
In a possible implementation manner, the acoustic model includes a backbone network, a self-attention mechanism layer, and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:
inputting the text characteristic sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
inputting the intermediate multidimensional acoustic parameter sequence into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a full connection layer to obtain a multidimensional acoustic parameter sequence.
In a possible implementation, the inputting the intermediate multidimensional acoustic parameter sequence into the attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:
for intermediate multidimensional acoustic parameter sequencesVector ofAre multiplied by three weight coefficients respectivelyThree vectors are obtained:
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
wherein the content of the first and second substances,the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;for the ith vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byThe vector obtained is then used as a basis for determining,for the jth vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byObtaining a vector;
In a possible embodiment, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the fully-connected layer to obtain a multidimensional acoustic parameter sequence includes:
wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix isFCN is the full connection layer for vector correlation matrices.
In one possible embodiment, the vocoder model is trained by:
and taking the multi-dimensional acoustic parameter sequence Y as input and the time domain sampling sequence as output to train a neural network model to obtain a vocoder model.
In one possible implementation, the acoustic model is trained by the following expression:
wherein the content of the first and second substances,a loss function is trained for the acoustic model,in order to be the weight coefficient,is a true vector of the qth acoustic parameter of the kth frame,and (5) actually predicting the vector for the model of the q-th class acoustic parameter of the k-th frame.
In a possible implementation manner, the true vector of the q-th acoustic parameter of the k-th frame is obtained by the following steps:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
In a second aspect, an embodiment of the present disclosure provides a speech confrontation sample generation apparatus, including:
the extraction module is used for receiving a target text and extracting a text feature sequence from the target text;
the input module is used for inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and the generating module is used for inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a confrontation sample corresponding to the target text.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the voice countermeasure sample generation method when executing the program stored in the memory.
In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the above-mentioned voice countermeasure sample generation method.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages:
the voice countermeasure sample generation method of the embodiment of the disclosure receives a target text and extracts a text feature sequence from the target text; inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence; the multi-dimensional acoustic parameter sequence is input into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to a target text, and the output of the acoustic model is the multi-dimensional acoustic parameter sequence, so that the generated voice content can ensure high similarity (matching degree) under the description of various acoustic feature dimensions.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 schematically illustrates a flow diagram of a method of generating a speech confrontation sample according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a method for generating a speech confrontation sample according to another embodiment of the present disclosure;
fig. 3 schematically shows a block diagram of a structure of a speech countermeasure sample generation apparatus according to an embodiment of the present disclosure; and
fig. 4 schematically shows a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
Referring to fig. 1, an embodiment of the present disclosure provides a method for generating a speech confrontation sample, including the following steps:
s1, receiving a target text and extracting a text feature sequence from the target text;
in practical application, for a target text of a confrontation sample to be generated, text features in conventional speech synthesis are obtained through text regularization, text-to-phoneme, polyphone prediction, prosodic pause and other predictions, and for each phoneme f, the text featuresContains a series of characteristics such as phoneme information, tone information, part of speech information, prosodic pause information and the like which can be helpful for acoustic modeling. Assuming that a sample text of the training corpus contains N phoneme units, defining a text characteristic sequence after processing and quantization as。
S2, inputting the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and S3, inputting the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generating a time domain sampling sequence (time domain voice waveform) of voice as a countermeasure sample corresponding to the target text, wherein the structure of the vocoder model can adopt a deep neural network, and is similar to the WaveRNN structure.
Referring to fig. 2, in step S2, the acoustic model includes a backbone network, an attention mechanism layer, and a full connection layer, and the step of inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence includes:
s21, inputting the text feature sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
in practical application, the backbone network may adopt a neural network structure of an encoder-decoder, wherein the encoder-decoder structure may apply various forms of deep neural networks such as LSTM, CNN, and the like.
S22, inputting the intermediate multidimensional acoustic parameter sequence into an attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and S23, inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain a multidimensional acoustic parameter sequence.
In this embodiment, in step S22, the inputting the intermediate multidimensional acoustic parameter sequence into the self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix includes:
for intermediate multidimensional acoustic parameter sequencesVector ofAre multiplied by three weight coefficients respectivelyThree vectors are obtained:
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
wherein the content of the first and second substances,the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;for the ith vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byThe vector obtained is then used as a basis for determining,for the jth vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byObtaining a vector;
In this embodiment, in step S23, the inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into the full connection layer to obtain the multidimensional acoustic parameter sequence includes:
wherein Y is a multidimensional acoustic parameter sequence, the matrix V is an intermediate multidimensional acoustic parameter matrix, and the matrix isFCN is the full connection layer for vector correlation matrices.
In this embodiment, in step S3, the vocoder model is obtained by training the following steps:
taking a multidimensional acoustic parameter sequence Y as input, taking a time domain sampling sequence as output to train a neural network model, and obtaining a vocoder model, wherein the acoustic model is trained through the following expression:
wherein the content of the first and second substances,a loss function is trained for the acoustic model,a weighting coefficient for adjusting the degree of emphasis generated for each dimension of acoustic parameters according to the speech generation detection model,is a true vector of the qth acoustic parameter of the kth frame,the model actual prediction vector for the qth class of acoustic parameters for the kth frame,represents a norm calculation formula of order 1, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by the following steps:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
The disclosed method for generating voice countermeasure samples is different from the conventional method for adding global relatively uniform countermeasure samples to the generated whole sentence of voice, adding phoneme related countermeasure samples in the process of generating voice according to different generated phonemes, generating voice countermeasure samples more specifically, and comprehensively reconstructing acoustic parameters commonly used by a voice generation detection system in the process of generating countermeasure samples, such as Mel Frequency Cepstral Coefficient (MFCC), Linear Frequency Cepstral Coefficient (LFCC), Linear Predictive Coefficient (LPC), constant Q-transform Cepstral Coefficient (CQCC), so that the generated voice is closer to the real voice in parameter distribution, namely an active attack means, and the multi-dimensional parameter reconstruction method can cheat various voice generation detection models, and the generated countermeasure samples cannot be correctly identified by the voice generation detection system, compared with the traditional method for generating the confrontation sample, the method is based on the mechanism of voice generation and detection, adopts the phoneme fine-grained means to generate the phoneme-related voice confrontation sample by generating different voice contents and utilizes the multi-dimensional parameter reconstruction method to improve the anti-detection capability of the voice confrontation sample
The voice confrontation sample generation method is based on the voice generation detection mechanism, can explainably improve the generation effect of the voice confrontation sample, can cheat the voice generation detection model more effectively, and is easy to operate and realize.
Referring to fig. 3, an embodiment of the present disclosure provides a voice confrontation sample generation apparatus, including:
the extraction module 11 is configured to receive a target text and extract a text feature sequence from the target text;
the input module 12 is configured to input the text feature sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and a generating module 13, configured to input the multidimensional acoustic parameter sequence into a pre-trained vocoder model, and generate a time-domain sampling sequence of speech as a countermeasure sample corresponding to the target text.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
In this embodiment, any plurality of the extraction module 11, the input module 12, and the generation module 13 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. At least one of the extraction module 11, the input module 12 and the generation module 13 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware and firmware, or in a suitable combination of any of them. Alternatively, at least one of the extraction module 11, the input module 12 and the generation module 13 may be at least partly implemented as a computer program module, which when executed may perform a corresponding function.
Referring to fig. 4, an electronic device provided by an embodiment of the present disclosure includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, where the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;
a memory 1130 for storing computer programs;
the processor 1110, when executing the program stored in the memory 1130, implements the method for generating a speech countermeasure sample as follows:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
The communication bus 1140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The Memory 1130 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.
The Processor 1110 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for generating a speech countermeasure sample as described above.
The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The above-mentioned computer-readable storage medium carries one or more programs which, when executed, implement the speech countermeasure sample generation method according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for generating a speech confrontation sample, comprising the steps of:
receiving a target text and extracting a text feature sequence from the target text;
inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a countermeasure sample corresponding to the target text.
2. The method of claim 1, wherein the acoustic model comprises a backbone network, an attention mechanism layer and a full connection layer, and the inputting the text feature sequence into a pre-trained acoustic model to obtain a multidimensional acoustic parameter sequence comprises:
inputting the text characteristic sequence into a backbone network to obtain an intermediate multidimensional acoustic parameter sequence;
inputting the intermediate multidimensional acoustic parameter sequence into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix;
and inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a full connection layer to obtain a multidimensional acoustic parameter sequence.
3. The method of claim 2, wherein the inputting the sequence of intermediate multidimensional acoustic parameters into a self-attention mechanism layer to obtain a vector correlation matrix and an intermediate multidimensional acoustic parameter matrix comprises:
for intermediate multidimensional acoustic parameter sequencesVector ofAre multiplied by three weight coefficients respectivelyThree vectors are obtained:
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix Q;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix K;
from vectors in intermediate multidimensional acoustic parameter sequencesCorresponding toForming a matrix V as an intermediate multidimensional acoustic parameter matrix;
and calculating the correlation between every two vectors in the intermediate multi-dimensional acoustic parameter sequence according to the matrix Q and the matrix K:
wherein the content of the first and second substances,the correlation between the ith vector and the jth vector in the intermediate multi-dimensional acoustic parameter sequence is obtained;for the ith vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byThe vector obtained is then used as a basis for determining,for the jth vector in the intermediate multidimensional acoustic parameter sequenceMultiplication byObtaining a vector;
4. The method of claim 3, wherein inputting the vector correlation matrix and the intermediate multidimensional acoustic parameter matrix into a fully connected layer to obtain a multidimensional acoustic parameter sequence comprises:
5. The method of claim 1, wherein the vocoder model is trained by:
and taking the multi-dimensional acoustic parameter sequence Y as input and the time domain sampling sequence as output to train a neural network model to obtain a vocoder model.
6. The method of claim 1, wherein the acoustic model is trained by the following expression:
wherein the content of the first and second substances,a loss function is trained for the acoustic model,in order to be the weight coefficient,is a true vector of the qth acoustic parameter of the kth frame,and (5) actually predicting the vector for the model of the q-th class acoustic parameter of the k-th frame.
7. The method according to claim 6, wherein the true vector of the qth acoustic parameter of the kth frame is obtained by:
extracting acoustic parameters from real voice corresponding to a target text, wherein the acoustic parameters comprise at least two of a Mel frequency cepstrum coefficient, a linear prediction coefficient and a constant Q transformation cepstrum coefficient;
splicing different types of acoustic parameters according to a frame unit to obtain a j-th frame real multi-dimensional acoustic parameter sequence;
and acquiring a real vector of the q-th class acoustic parameter from the k-th frame real multi-dimensional acoustic parameter sequence.
8. A speech confrontation sample generation apparatus, comprising:
the extraction module is used for receiving a target text and extracting a text feature sequence from the target text;
the input module is used for inputting the text characteristic sequence into a pre-trained acoustic model to obtain a multi-dimensional acoustic parameter sequence;
and the generating module is used for inputting the multi-dimensional acoustic parameter sequence into a pre-trained vocoder model to generate a time domain sampling sequence of voice as a confrontation sample corresponding to the target text.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method of generating a speech countermeasure sample according to any one of claims 1 to 7 when executing a program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the speech countermeasure sample generation method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210201797.6A CN114267363B (en) | 2022-03-03 | 2022-03-03 | Voice countercheck sample generation method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210201797.6A CN114267363B (en) | 2022-03-03 | 2022-03-03 | Voice countercheck sample generation method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114267363A true CN114267363A (en) | 2022-04-01 |
CN114267363B CN114267363B (en) | 2022-05-24 |
Family
ID=80833816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210201797.6A Active CN114267363B (en) | 2022-03-03 | 2022-03-03 | Voice countercheck sample generation method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114267363B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109859736A (en) * | 2019-01-23 | 2019-06-07 | 北京光年无限科技有限公司 | Phoneme synthesizing method and system |
EP3599606A1 (en) * | 2018-07-26 | 2020-01-29 | Accenture Global Solutions Limited | Machine learning for authenticating voice |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN112786011A (en) * | 2021-01-13 | 2021-05-11 | 北京有竹居网络技术有限公司 | Speech synthesis method, synthesis model training method, apparatus, medium, and device |
CN113205792A (en) * | 2021-04-08 | 2021-08-03 | 内蒙古工业大学 | Mongolian speech synthesis method based on Transformer and WaveNet |
CN114121010A (en) * | 2021-11-30 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Model training, voice generation, voice interaction method, device and storage medium |
-
2022
- 2022-03-03 CN CN202210201797.6A patent/CN114267363B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3599606A1 (en) * | 2018-07-26 | 2020-01-29 | Accenture Global Solutions Limited | Machine learning for authenticating voice |
CN109859736A (en) * | 2019-01-23 | 2019-06-07 | 北京光年无限科技有限公司 | Phoneme synthesizing method and system |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN112786011A (en) * | 2021-01-13 | 2021-05-11 | 北京有竹居网络技术有限公司 | Speech synthesis method, synthesis model training method, apparatus, medium, and device |
CN113205792A (en) * | 2021-04-08 | 2021-08-03 | 内蒙古工业大学 | Mongolian speech synthesis method based on Transformer and WaveNet |
CN114121010A (en) * | 2021-11-30 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Model training, voice generation, voice interaction method, device and storage medium |
Non-Patent Citations (1)
Title |
---|
NAIHAN LI 等: "Neural speech synthesis with transformer network", 《PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114267363B (en) | 2022-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
Balamurali et al. | Toward robust audio spoofing detection: A detailed comparison of traditional and learned features | |
Patton et al. | AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech | |
US11450332B2 (en) | Audio conversion learning device, audio conversion device, method, and program | |
CN104903954A (en) | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination | |
CN111916111A (en) | Intelligent voice outbound method and device with emotion, server and storage medium | |
JP2019215500A (en) | Voice conversion learning device, voice conversion device, method, and program | |
Chatterjee et al. | Auditory model-based design and optimization of feature vectors for automatic speech recognition | |
CN111968652A (en) | Speaker identification method based on 3DCNN-LSTM and storage medium | |
CN113327575B (en) | Speech synthesis method, device, computer equipment and storage medium | |
Rudresh et al. | Performance analysis of speech digit recognition using cepstrum and vector quantization | |
CN110648655A (en) | Voice recognition method, device, system and storage medium | |
Radha et al. | Speech and speaker recognition using raw waveform modeling for adult and children’s speech: a comprehensive review | |
CN114267363B (en) | Voice countercheck sample generation method and device, electronic equipment and storage medium | |
Praveen et al. | Text dependent speaker recognition using MFCC features and BPANN | |
Rao | Accent classification from an emotional speech in clean and noisy environments | |
Nijhawan et al. | Speaker recognition using support vector machine | |
Bawa et al. | Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions | |
Bakshi et al. | Spoken Indian language classification using GMM supervectors and artificial neural networks | |
Bhaskar et al. | Analysis of language identification performance based on gender and hierarchial grouping approaches | |
Ehkan et al. | Hardware implementation of MFCC-based feature extraction for speaker recognition | |
Harere et al. | Mispronunciation detection of basic quranic recitation rules using deep learning | |
Nijhawan et al. | Real time speaker recognition system for hindi words | |
Srinivas | LFBNN: robust and hybrid training algorithm to neural network for hybrid features-enabled speaker recognition system | |
CN116705036B (en) | Multi-level feature fusion-based phrase voice speaker recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |