CN112562686B - Zero-sample voice conversion corpus preprocessing method using neural network - Google Patents

Zero-sample voice conversion corpus preprocessing method using neural network Download PDF

Info

Publication number
CN112562686B
CN112562686B CN202011433778.3A CN202011433778A CN112562686B CN 112562686 B CN112562686 B CN 112562686B CN 202011433778 A CN202011433778 A CN 202011433778A CN 112562686 B CN112562686 B CN 112562686B
Authority
CN
China
Prior art keywords
speaker
generator
identity
output
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202011433778.3A
Other languages
Chinese (zh)
Other versions
CN112562686A (en
Inventor
魏建国
更太加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qinghai Nationalities University
Original Assignee
Qinghai Nationalities University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qinghai Nationalities University filed Critical Qinghai Nationalities University
Priority to CN202011433778.3A priority Critical patent/CN112562686B/en
Publication of CN112562686A publication Critical patent/CN112562686A/en
Application granted granted Critical
Publication of CN112562686B publication Critical patent/CN112562686B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a zero sample voice conversion corpus preprocessing method using a neural network, which improves the effectiveness of speaker identity coding vectors in the new field of zero sample voice conversion, obtains certain improvement on the quality of converted voice, carries out preprocessing on the speaker identity codes based on the neural network, uses a speaker identity coder to extract the speaker identity coding vectors in the corpus, inputs the extracted vectors and content codes extracted by using the content coder into a generator, obtains the adjusted speaker identity coding vectors by the generator, takes the result obtained from the generator as the final identity coding vector of the speaker, namely, the final identity coding vector is input into a decoder as an identity label of the speaker, and the decoder generates a converted audio characteristic sequence.

Description

Zero-sample voice conversion corpus preprocessing method using neural network
Technical Field
The application relates to the technical field of network security, in particular to a zero-sample voice conversion corpus preprocessing method using a neural network.
Background
The most central application of speech conversion technology is to change the timbre of speech to make it sound like the target speaker speaks.
In recent years, the field of speech conversion has exploded, moving from parallel corpus requirements and manual alignment and parallel systems to non-parallel systems that do not require parallel corpus. The non-parallel system has the advantages that the requirement on the corpus required by training is not high, the training is very flexible, and the acquisition is convenient, so that the non-parallel system expands the application field of the voice conversion technology.
However, the conventional non-parallel system can only realize the voice conversion under the conditions of one-to-one, one-to-many, many-to-one, many-to-many, that is, the source and speaker in the voice conversion task must be speakers in the training set, so for speakers not in the training set, when they are used as the source or target speaker to perform voice conversion, the corresponding voice data must be used to perform the training of the model again. Generally speaking, the neural network model used for voice conversion is relatively complex, retraining the model necessarily results in a great deal of time and effort consumption, and parameters of model training are often required to be continuously adjusted to enable the voice conversion to work normally. Therefore, in recent years, zero-sample voice conversion technology has become a new research direction in the voice conversion field.
The zero sample means that a source speaker or a target speaker in the voice conversion task does not need to be in a training set but can be any speaker, that is, the zero sample voice conversion technology realizes that the voice conversion task from any speaker to any speaker can be completed by using one voice conversion model, and breaks through the limit of speakers in the training set.
One core idea of the zero-sample speech conversion technique is to use a speaker identity code vector to represent the identity tag of a speaker, but the following problems follow:
1) when the available corpus of the source or target speaker is very little, the obtained speaker identity coding vector is not necessarily reliable;
2) when the speaker identity coding vector is used for model training, the average value is required to be calculated, and a fixed vector is used for representing a fixed speaker;
3) for speakers who are not in the training set, the matching degree of the speaker identity coding vector used in conversion and the voice conversion model is not high.
Therefore, a zero-sample speech conversion corpus preprocessing method using a neural network is urgently needed.
Disclosure of Invention
The invention aims to provide a zero-sample voice conversion corpus preprocessing method using a neural network, which improves the validity of speaker identity coding vectors in the new field of zero-sample voice conversion and improves the quality of converted voice to a certain extent.
In a first aspect, the present application provides a method for preprocessing zero-sample speech conversion corpus using a neural network, the method comprising:
a generator of a neural network is used for preprocessing the identity coding vector of the speaker which is not in the training set, and 256-dimensional vectors are used for representing personalized characteristics of tone and the like of the speaker and correspond to the identity label of the speaker;
separating the speaker related information and the speaker irrelevant information in the voice information through an encoder, wherein the extracted speaker related information is 32-dimensional or 64-dimensional;
the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are the recurrent neural network LSTM, and the last column of output which is output for the last time is selected as the final output of the LSTM after the three layers of LSTM are completely finished, wherein the dimension is 768; the last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;
the generator is independently trained from the voice conversion model, so that a result which is close to the value but not completely consistent is output according to the input speaker identity coding vector, and the output of the generator is better when the generator is closer to the identity coding vector of the corresponding speaker used in the training of the voice conversion model;
preprocessing the speaker identity code based on a neural network, extracting a speaker identity code vector in a corpus by using a speaker identity encoder, inputting the extracted vector and the content code extracted by using the content encoder into a generator, obtaining the adjusted speaker identity code vector by the generator, taking the result obtained from the generator as the final identity code vector of the speaker, namely as an identity label of the speaker, inputting the identity label into a decoder, and generating a converted audio characteristic sequence by the decoder.
With reference to the first aspect, in a first possible implementation manner of the first aspect, a final training target of the generator is as follows:
Figure 807512DEST_PATH_IMAGE001
Figure 752465DEST_PATH_IMAGE002
representing the audio characteristics of the original speech sound,
Figure 543703DEST_PATH_IMAGE003
an identity encoder for a representative of a speaker,
Figure 934102DEST_PATH_IMAGE004
representing the obtained initial speaker identity code vector;
Figure 41867DEST_PATH_IMAGE005
Figure 282355DEST_PATH_IMAGE006
the generator represents the adjusted speaker identity code vector, namely the corresponding output result of the generator;
Figure 498573DEST_PATH_IMAGE007
Figure 974554DEST_PATH_IMAGE008
representing a loss function during training of the generator by averaging the output of the generator with the identity code vectors of the plurality of speakers used in the speech conversion model
Figure 795879DEST_PATH_IMAGE009
The gap of the generator is minimized, and the generator is trained.
With reference to the first aspect, in a second possible implementation manner of the first aspect, the generation of the confrontation network is composed of a generator and a discriminator, which are continuously optimized and iterated in the confrontation process according to a given objective function, so as to finally obtain a model.
In a second aspect, the present application provides a zero-sample speech conversion corpus preprocessing system using a neural network, the system comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is adapted to perform the method steps of any of all possible of the first aspect according to instructions in the program code.
In a third aspect, the present application provides a computer readable storage medium for storing program code for performing the method steps of any one of all possible aspects of the first aspect.
In a fourth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any one of all possible of the first aspect.
The invention provides a preprocessing method of zero sample voice conversion corpus using neural network, which improves the validity of speaker identity coding vector in the new field of zero sample voice conversion, obtains certain improvement on the quality of the converted voice, carries out preprocessing of speaker identity coding based on neural network, uses speaker identity coder to extract speaker identity coding vector in corpus, inputs the extracted vector and content coding extracted by content coder into generator, obtains adjusted speaker identity coding vector by generator, uses the result obtained from generator as final identity coding vector of speaker, i.e. as identity label of speaker to be input into decoder, and generates converted audio characteristic sequence by decoder.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a speech conversion system of the present invention.
FIG. 2 is a diagram illustrating the pretreatment process of the present invention.
FIG. 3 is a diagram of a generator network architecture design of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.
The application provides a zero-sample voice conversion corpus preprocessing method using a neural network, which comprises the following steps:
a generator of a neural network is used for preprocessing the identity coding vector of the speaker which is not in the training set, and 256-dimensional vectors are used for representing personalized characteristics of tone and the like of the speaker and correspond to the identity label of the speaker;
separating the speaker related information and the speaker irrelevant information in the voice information through an encoder, wherein the extracted speaker related information is 32-dimensional or 64-dimensional;
the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are a recurrent neural network LSTM, the last column output of the last output is selected as the final output of the LSTM after the three layers of LSTM are all finished, and the dimension is 768; the last layer of neural network is a full connection layer Fullconnect, which limits the output dimension to 256 bits again, and finally obtains the preprocessed speaker identity code;
the generator is independently trained from the voice conversion model, so that a result which is close to the value but not completely consistent is output according to the input speaker identity coding vector, and the output of the generator is better when the generator is closer to the identity coding vector of the corresponding speaker used in the training of the voice conversion model;
preprocessing the speaker identity code based on a neural network, extracting a speaker identity code vector in a corpus by using a speaker identity encoder, inputting the extracted vector and the content code extracted by using the content encoder into a generator, obtaining the adjusted speaker identity code vector by the generator, taking the result obtained from the generator as the final identity code vector of the speaker, namely as an identity label of the speaker, inputting the identity label into a decoder, and generating a converted audio characteristic sequence by the decoder.
The method mainly comprises a generator using the neural network, and the generator is used for preprocessing the identity coding vector of the speaker which is not in the training set, so that the effectiveness is improved.
The generator is described as follows:
1) in the current zero-sample voice conversion technology, 256-dimensional vectors are generally used for representing personalized features such as timbre and the like of a speaker, and can be regarded as an identity tag corresponding to the speaker;
2) for the most popular zero sample voice conversion frame based on the self-encoder at present, the feature separation is a core processing mode, the feature separation refers to separating speaker-related information and speaker-unrelated information in voice information through the encoder, the extracted speaker-related information is generally 32-dimensional or 64-dimensional, the method provided by the invention has general application value, only 64-dimensional condition is taken as an example for explanation, and zero sample voice conversion systems based on other dimensions or other frames are also applicable;
3) the generator aims to preprocess the identity coding vector of the speaker not in the training set, the input dimension is 256+64, the output dimension is 256, and finally the purpose of processing the identity coding vector and improving the usability of the identity coding vector is achieved;
4) the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernel size of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolution neural network is 512. The next three layers are the recurrent neural network LSTM, and the last column output of the last output is selected as the final output of the LSTM after the three layers of the LSTM are all finished, and the dimension is 768. The last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;
5) the generator is independently trained from the voice conversion model, the main idea of training the generator is to enable the generator to output a result which is close to the value but not completely consistent according to the input speaker identity coding vector, and for the output of the generator, the closer the generator is to the identity coding vector of the corresponding speaker used in the training of the voice conversion model, the better the generator is.
6) Sufficient available speaker ID encoding vectors are important for zero sample speech conversion. The technology related to speaker identity coding is derived from the field of speaker recognition, but the application of the technology in the field of speaker recognition and the field of voice conversion are different, and the two fields have different requirements on the technology. In the field of speaker recognition, the speaker identity coding vector is mainly used for judging whether two sections of voices belong to the same speaker or judging the identity of the speaker to which one section of voice belongs, so that the speaker identity coding vector obtained by a speaker coder is directly available, the final effect is not influenced by proper deviation, and the actual requirement can be met. However, in the field of voice conversion, the speaker identity code vector is used as an identity tag of a speaker, so that each unique speaker preferably corresponds to a completely determined and accurate speaker identity code vector, and thus, a better voice conversion effect can be obtained.
In order to solve the above problems, there are the following ideas:
first, it is not practical to improve the performance of the speaker ID encoder sufficiently so that the output speaker ID code vector can converge to a relatively precise point, because the current state of the art cannot meet this requirement.
Second, each speaker who is not in the training set has a sufficiently large amount of audio corpus data as a reference. The method can use speaker coding to extract a large number of speaker identity coding vectors from a large number of linguistic data so as to obtain a relatively stable average value, and can approach the optimal point used by a speech conversion model.
Thirdly, processing each sentence corpus data as a separate and different speaker, and when training a voice conversion model, using speaker identity coding vectors extracted from corresponding sentence audio as the input of a voice conversion system for different utterances of the same speaker. But this method is also completely infeasible, and the experimental results fully show that the training result of the speech conversion model cannot be converged by using this method, and finally the training result of the whole model is collapsed.
Fourthly, preprocessing the identity code of the speaker based on a neural network, namely the method provided by the invention.
7) The steps of using the generator in the conversion phase are as follows:
firstly, extracting a speaker identity coding vector in a corpus by using a speaker identity coder;
second, the extracted vector is input to a generator together with a content code extracted using a content encoder;
thirdly, the generator obtains the adjusted speaker identity coding vector;
fourthly, inputting the result obtained from the generator into a decoder as the final identity coding vector of the speaker, namely as the identity label of the speaker;
fifth, the decoder generates a sequence of transformed audio features.
8) The final training objectives for this generator are as follows:
Figure 572381DEST_PATH_IMAGE010
representing the audio characteristics of the original speech sound,
Figure 479157DEST_PATH_IMAGE003
an encoder for the identity of the representative of the speaker,
Figure 493250DEST_PATH_IMAGE004
representing the resulting initial speaker identity code vector.
Figure 434661DEST_PATH_IMAGE005
Figure 361160DEST_PATH_IMAGE006
Which represents a generator according to the present invention,
Figure 925871DEST_PATH_IMAGE011
representing the adjusted speaker identity code vector, i.e. the corresponding output of the generator.
Figure 619020DEST_PATH_IMAGE007
Figure 273993DEST_PATH_IMAGE008
Representing a loss function during training of the generator by smoothing the identity-encoded vectors of the speakers used in the output speech conversion model of the generatorMean value
Figure 292764DEST_PATH_IMAGE009
The gap of (a) is minimized, and the generator is trained.
After training, the generator correspondingly has the capability of reasonably adjusting the speaker identity coding vector which is not in the training set, and experiments show that the method can effectively improve the usability of the speaker identity coding vector, and particularly can greatly improve the matching degree of the speaker identity coding vector and a voice conversion model when the task of zero-sample voice conversion is carried out, so that the naturalness and the similarity of the converted voice are finally improved.
The present application provides a zero-sample speech conversion corpus preprocessing system using a neural network, the system comprising: the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method steps of any of the embodiments of the first aspect in accordance with instructions in the program code.
The present application provides a computer readable storage medium for storing program code for performing the method steps of any of the embodiments of the first aspect.
The present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any of the embodiments of the first aspect.
In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium can be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented using software plus any required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The same and similar parts between the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.
The above-described embodiments of the present invention do not limit the scope of the present invention.

Claims (5)

1. A method for preprocessing zero-sample speech converted corpus using neural networks, the method comprising:
a generator of a neural network is used for preprocessing the identity coding vector of the speaker which is not in the training set, and 256-dimensional vectors are used for representing the tone personalized characteristics of the speaker and correspond to the identity label of the speaker;
separating the speaker related information and the speaker irrelevant information in the voice information through an encoder, wherein the extracted speaker related information is 32-dimensional or 64-dimensional;
the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are a recurrent neural network LSTM, the last column output of the last output is selected as the final output of the LSTM after the three layers of LSTM are all finished, and the dimension is 768; the last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;
the generator is independently trained from the voice conversion model, so that a result which is close to the identity coding vector of the speaker but is not completely consistent is output according to the input identity coding vector of the speaker, and for the output of the generator, the closer to the identity coding vector of the corresponding speaker used in the training of the voice conversion model, the better;
preprocessing the speaker identity code based on a neural network, extracting a speaker identity code vector in a corpus by using a speaker identity encoder, inputting the extracted vector and a content code extracted by using a content encoder into a generator, obtaining the adjusted speaker identity code vector by the generator, taking a result obtained from the generator as a final identity code vector of the speaker, namely as an identity label of the speaker, inputting the final identity code vector into a decoder, and generating a converted audio characteristic sequence by the decoder.
2. The method of claim 1, wherein: the final training goals for this generator are as follows:
SnA=Es(XnA)
representing the audio characteristics of the original speech, representing the identity of the speaker, encoder, SnARepresenting the obtained initial speaker identity code vector;
S′A=G(SnA)
g represents a generator, S 'of the invention'ARepresenting the adjusted speaker identity code vector, namely the corresponding output result of the generator;
Figure FDA0003631455960000011
Ladjustrepresenting a loss function in training the generator by correlating the output of the generator with the identity of the speakers used in the speech conversion modelAverage value S of the encoded vectorAThe gap of (a) is minimized, and the generator is trained.
3. Method according to one of claims 1-2, characterized in that: the generation of the confrontation network consists of a generator and a discriminator, which are continuously optimized and iterated in the confrontation process according to a given objective function, and finally a model is obtained.
4. A system for zero-sample speech corpus pre-processing using neural networks, the system comprising a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method steps of any of claims 1-3 according to instructions in the program code.
5. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method steps of any of claims 1-3.
CN202011433778.3A 2020-12-10 2020-12-10 Zero-sample voice conversion corpus preprocessing method using neural network Expired - Fee Related CN112562686B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011433778.3A CN112562686B (en) 2020-12-10 2020-12-10 Zero-sample voice conversion corpus preprocessing method using neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011433778.3A CN112562686B (en) 2020-12-10 2020-12-10 Zero-sample voice conversion corpus preprocessing method using neural network

Publications (2)

Publication Number Publication Date
CN112562686A CN112562686A (en) 2021-03-26
CN112562686B true CN112562686B (en) 2022-07-15

Family

ID=75060199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011433778.3A Expired - Fee Related CN112562686B (en) 2020-12-10 2020-12-10 Zero-sample voice conversion corpus preprocessing method using neural network

Country Status (1)

Country Link
CN (1) CN112562686B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018157703A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Natural language semantic extraction method and device, and computer storage medium
WO2019047703A1 (en) * 2017-09-06 2019-03-14 腾讯科技(深圳)有限公司 Audio event detection method and device, and computer-readable storage medium
WO2019096149A1 (en) * 2017-11-15 2019-05-23 中国科学院自动化研究所 Auditory selection method and device based on memory and attention model
WO2019196196A1 (en) * 2018-04-12 2019-10-17 科大讯飞股份有限公司 Whispering voice recovery method, apparatus and device, and readable storage medium
CN110537222A (en) * 2017-04-21 2019-12-03 高通股份有限公司 Anharmonic wave speech detection and bandwidth expansion in multi-source environment
CN111144124A (en) * 2018-11-02 2020-05-12 华为技术有限公司 Training method of machine learning model, intention recognition method, related device and equipment
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018157703A1 (en) * 2017-03-02 2018-09-07 腾讯科技(深圳)有限公司 Natural language semantic extraction method and device, and computer storage medium
CN110537222A (en) * 2017-04-21 2019-12-03 高通股份有限公司 Anharmonic wave speech detection and bandwidth expansion in multi-source environment
WO2019047703A1 (en) * 2017-09-06 2019-03-14 腾讯科技(深圳)有限公司 Audio event detection method and device, and computer-readable storage medium
WO2019096149A1 (en) * 2017-11-15 2019-05-23 中国科学院自动化研究所 Auditory selection method and device based on memory and attention model
WO2019196196A1 (en) * 2018-04-12 2019-10-17 科大讯飞股份有限公司 Whispering voice recovery method, apparatus and device, and readable storage medium
CN111144124A (en) * 2018-11-02 2020-05-12 华为技术有限公司 Training method of machine learning model, intention recognition method, related device and equipment
CN111816156A (en) * 2020-06-02 2020-10-23 南京邮电大学 Many-to-many voice conversion method and system based on speaker style feature modeling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"零样本图像识别";兰红;《电子与信息学报》;20200330;全文 *
潘崇煜." 融合零样本学习和小样本学习的弱监督学习方法综述".《系统工程与电子技术》.2020, *

Also Published As

Publication number Publication date
CN112562686A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
Meng et al. Internal language model estimation for domain-adaptive end-to-end speech recognition
Kameoka et al. ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
Li et al. Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion
Tjandra et al. VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019
Kameoka et al. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks
Liu et al. Any-to-many voice conversion with location-relative sequence-to-sequence modeling
Cai et al. A novel learnable dictionary encoding layer for end-to-end language identification
CN111312245B (en) Voice response method, device and storage medium
Park et al. Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data
Kameoka et al. Many-to-many voice transformer network
Nguyen et al. Nvc-net: End-to-end adversarial voice conversion
CN112712813B (en) Voice processing method, device, equipment and storage medium
Tüske et al. Advancing Sequence-to-Sequence Based Speech Recognition.
CN111862934A (en) Method for improving speech synthesis model and speech synthesis method and device
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN113360610A (en) Dialog generation method and system based on Transformer model
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Saito et al. DNN-based speaker embedding using subjective inter-speaker similarity for multi-speaker modeling in speech synthesis
Chandak et al. Streaming language identification using combination of acoustic representations and ASR hypotheses
CN117765959A (en) Voice conversion model training method and voice conversion system based on pitch
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
Zhao et al. Research on voice cloning with a few samples
CN116564330A (en) Weak supervision voice pre-training method, electronic equipment and storage medium
CN113628608A (en) Voice generation method and device, electronic equipment and readable storage medium
CN112562686B (en) Zero-sample voice conversion corpus preprocessing method using neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220715