WO2021237923A1 - 智能配音方法、装置、计算机设备和存储介质 - Google Patents

智能配音方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021237923A1
WO2021237923A1 PCT/CN2020/105266 CN2020105266W WO2021237923A1 WO 2021237923 A1 WO2021237923 A1 WO 2021237923A1 CN 2020105266 W CN2020105266 W CN 2020105266W WO 2021237923 A1 WO2021237923 A1 WO 2021237923A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
source
target
speaker
generator
Prior art date
Application number
PCT/CN2020/105266
Other languages
English (en)
French (fr)
Inventor
马坤
王家桢
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021237923A1 publication Critical patent/WO2021237923A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the technical field of artificial intelligence speech processing, in particular to an intelligent dubbing method, device, computer equipment, and computer-readable storage medium.
  • Dubbing is an important work in the field of film and television entertainment.
  • the inventor realized that, at present, in order to complete certain dubbing tasks, it is often necessary to find someone with the corresponding speaking style and timbre to do the dubbing in person. This method is time-consuming, laborious, and inefficient.
  • the purpose of this application is to provide an intelligent dubbing method, device, computer equipment, and computer-readable storage medium.
  • an intelligent dubbing method including:
  • the first Mel cepstrum frequency coefficient is input to the forward generator or the reverse generator of the pre-trained cyclic generation confrontation network model, and the first output from the forward generator or the reverse generator is obtained.
  • the second mel cepstrum frequency coefficient of a predetermined number of dimensions wherein, when the speaker to be dubbed is the source speaker, the first mel cepstrum frequency coefficient is input to the forward generator, and the speaker to be dubbed
  • the first Mel cepstrum frequency coefficient is input to the reverse generator
  • the cyclically generated confrontation network model includes a forward generator, a reverse generator, a forward discriminator, and a reverse generator.
  • a discriminator, the forward generator and the reverse generator of the pre-trained loop generating the confrontation network model use the source voice data of the source speaker and the target voice data of the target speaker and generate the confrontation network model based on the loop Trained forward discriminator and reverse discriminator;
  • the target voice data of the target speaker Based on the source voice data of the source speaker, the target voice data of the target speaker, the second Mel cepstrum frequency coefficient, the fundamental frequency of the voice data, and the non-periodic signal parameter generation and the speaker to be dubbed The relative target speaker or source speaker's voice.
  • an intelligent dubbing device including:
  • An acquiring module configured to acquire voice data of a speaker to be dubbed, the speaker to be dubbed being one of the source speaker and the target speaker;
  • the processing and extraction module is configured to perform standardized processing on the voice data, and extract the spectrum envelope, fundamental frequency, and aperiodic signal parameters of the voice data after the standardized processing;
  • An extraction module which extracts the first Mel cepstrum frequency coefficients of the first predetermined number of dimensions of the spectrum envelope
  • the input module is configured to input the first Mel cepstrum frequency coefficients to the forward generator or the reverse generator of the pre-trained cyclic generation confrontation network model to obtain the The second mel cepstrum frequency coefficient of the first predetermined number of dimensions output by the generator, wherein, when the speaker to be dubbed is the source speaker, the first mel cepstrum frequency coefficient is input to the forward generator When the speaker to be dubbed is the target speaker, the first Mel cepstrum frequency coefficient is input to the reverse generator, and the cyclically generated confrontation network model includes a forward generator, a reverse generator, and a forward generator.
  • the forward and backward discriminators, the forward generator and the reverse generator of the pre-trained loop generation against the network model use the source speech data of the source speaker and the target speech data of the target speaker and are based on the The forward discriminator and the reverse discriminator of the cyclic generation confrontation network model are trained;
  • a generating module configured to generate and generate data based on the source voice data of the source speaker, the target voice data of the target speaker, the second Mel cepstrum frequency coefficient, the fundamental frequency of the voice data, and the non-periodic signal parameters The voice of the target speaker or the source speaker relative to the speaker to be dubbed.
  • a computer device including a memory and a processor, the memory is used to store a program for smart dubbing of the processor, and the processor is configured to execute the following by executing the program for smart dubbing Processing: Obtain the voice data of the speaker to be dubbed, the speaker to be dubbed is one of the source speaker and the target speaker; standardize the voice data, and extract the voice data after the standardized processing Spectrum envelope, fundamental frequency and non-periodic signal parameters; extract the first mel cepstrum frequency coefficient of the first predetermined number of dimensions of the spectrum envelope; input the first mel cepstrum frequency coefficient to the pre-trained
  • the cyclic generation against the forward generator or the reverse generator of the network model obtains the second Mel cepstrum frequency coefficients of the first predetermined number of dimensions output by the forward generator or the reverse generator, where When the speaker to be dubbed is the source speaker, the first Mel cepstrum frequency coefficient is input to the forward generator, and when the speaker to be dubbed is the target speaker, the first
  • the forward generator and the reverse generator are trained by using the source speech data of the source speaker and the target speech data of the target speaker and generating the forward discriminator and the reverse discriminator of the anti-network model based on the cycle;
  • the source voice data of the source speaker, the target voice data of the target speaker, the second Mel cepstrum frequency coefficient, the fundamental frequency of the voice data, and aperiodic signal parameters generate a target relative to the speaker to be dubbed The voice of the speaker or source speaker.
  • a computer-readable storage medium storing computer-readable instructions
  • a program for smart dubbing is stored thereon, and when the program for smart dubbing is executed by a processor, the following processing is implemented:
  • Obtain the speaker to be dubbed The voice data of the voice data, the speaker to be dubbed is one of the source speaker and the target speaker; the voice data is standardized, and the spectrum envelope, fundamental frequency and frequency of the standardized voice data are extracted Non-periodic signal parameters; extract the first Mel cepstrum frequency coefficients of the first predetermined number of dimensions of the spectrum envelope; input the first Mel cepstrum frequency coefficients into the pre-trained loop generation countermeasure network model
  • the forward generator or the reverse generator obtains the second Mel cepstrum frequency coefficients of the first predetermined number of dimensions output by the forward generator or the reverse generator, wherein the speaker to be dubbed is the source
  • the speaker is a speaker
  • the first Mel cepstrum frequency coefficient is input to the forward generator, and when the speaker to be dubbed is the target
  • the above-mentioned intelligent dubbing method, device, computer equipment and computer-readable storage medium first perform standardized processing on the voice data of the speaker to be dubbed, and then extract the standardized processed voice data spectrum envelope, fundamental frequency and non-periodic signal Parameters, and then extract a mel cepstrum frequency coefficient, and then input the first mel cepstrum frequency coefficient into the pre-trained forward generator or reverse generator of the cyclic generation anti-network model to obtain the second mel cepstrum frequency coefficient Spectral frequency coefficient, finally based on the second Mel cepstrum frequency coefficient, source speech data, target speech data, fundamental frequency and non-periodic signal parameters to generate speech.
  • Fig. 1 is a schematic diagram showing a system architecture of an application of an intelligent dubbing method according to an exemplary embodiment
  • Fig. 2 is a flow chart showing an intelligent dubbing method according to an exemplary embodiment
  • Fig. 3 is a schematic diagram showing the architecture of a cyclic generation confrontation network model used when applying the intelligent dubbing method provided by the present application according to an exemplary embodiment
  • Fig. 4A is a schematic diagram showing the principle of the cyclic consistency loss and the antagonistic loss of the cyclically generated confrontation network model according to an exemplary embodiment
  • Fig. 4B is a schematic diagram showing the principle of cyclic generation against the identity mapping loss of the network model according to an exemplary embodiment
  • Fig. 5 is a block diagram showing an intelligent dubbing device according to an exemplary embodiment
  • Fig. 6 is an exemplary block diagram showing a computer device that implements the above intelligent dubbing method according to an exemplary embodiment
  • Fig. 7 shows a computer-readable storage medium for realizing the above intelligent dubbing method according to an exemplary embodiment.
  • This application first provides an intelligent dubbing method.
  • Intelligent dubbing refers to the conversion of the voice of the first person into the voice of the second person. The content of the voices of the two people before and after the conversion remains unchanged, but the timbres of the voices of the two before and after the conversion belong to the first person and the second person respectively.
  • the intelligent dubbing method provided in this application applies artificial intelligence technology to realize intelligent dubbing. This method can be applied to many fields. For example, in the intelligent marketing and intelligent operation of the financial field, the voice of the customer service agent can be converted into a softer voice.
  • the sound color can improve the experience of the caller, thereby enhancing the sales of the product; in the field of education, it can be used in audiobooks and online education to convert the voice of the teacher who teaches online into the voice of the teacher that students like. Stimulate the interest of learning; in the field of film and television, intelligent dubbing can be realized. For example, when shooting a documentary of a deceased great man, you can extract its voice in historical video and video. After the dubbing of the documentary by a voice actor, the sound is converted into that of the deceased great man. The sound will make the documentary more contemporary and realistic.
  • the implementation terminal of this application can be any device with arithmetic processing and communication functions.
  • the device can be connected to an external device to receive or send data.
  • it can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, or a PDA. (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as cloud computing physical infrastructure or server clusters.
  • the implementation terminal of this application may be a server or a cloud computing physical infrastructure.
  • Fig. 1 is a schematic diagram showing a system architecture of an application of an intelligent dubbing method according to an exemplary embodiment.
  • the system architecture includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a communication link to receive or send data.
  • the user can input voice data through the terminal 110.
  • the server 120 is the implementation terminal of this application.
  • the server 110 is equipped with a pre-trained cyclic generation confrontation network model.
  • the voice data entered by the user through the terminal 110 is uploaded to the server 120 and then the server can be used
  • the cyclic generation on the 110 counters the network model for dubbing.
  • Fig. 1 is only an embodiment of the present application.
  • the voice data of the user, the speaker to be dubbed enters the voice data at the terminal and uploads it to the implementation terminal of this application via the terminal, but in other embodiments or specific applications, the terminal where the user enters the voice data and the local
  • the application terminal may be the same terminal; although in the embodiment of FIG. 1, it does not include other terminals connected to the server 120 except for the terminal 110, in other embodiments, it may also include other terminals connected to the server 120, for example, It may include a terminal that provides data for training on the server 120 to cyclically generate an adversarial network model.
  • Fig. 2 is a flow chart showing a method for smart dubbing according to an exemplary embodiment.
  • the smart dubbing method provided in this embodiment can be executed by a server, as shown in FIG. 2, and includes the following steps:
  • Step 210 Obtain the voice data of the speaker to be dubbed.
  • the speaker to be dubbed is one of the source speaker and the target speaker.
  • the voice data of the speaker to be dubbed may exist in various voice formats, such as CD format, WAV format, MP3 format, and so on.
  • the source speaker and the target speaker are relative to the training of the model.
  • the speaker to be dubbed is one of the source speaker and the target speaker, which means that whether the speaker to be dubbed is the source speaker or
  • the intelligent dubbing method provided in this application can convert the speaker to be dubbed into the voice of the opposite party.
  • Step 220 Perform standardization processing on the voice data, and extract the spectrum envelope, fundamental frequency, and aperiodic signal parameters of the voice data after the standardization processing.
  • the standardized processing of the voice data includes:
  • the voice data is converted into a sampling rate of a predetermined frequency and a predetermined format.
  • the voice data can be uniformly converted to 16000khz sampling rate, single-channel wav format.
  • the spectrum envelope is a curve formed by connecting the highest points of the amplitude of different frequencies, which is called the spectrum envelope.
  • the fundamental frequency refers to the frequency of the fundamental tone in a polyphony.
  • the non-periodic signal parameter is a parameter used to reflect the timbre of the voice.
  • the spectrum envelope, fundamental frequency, and aperiodic signal parameters of the voice data after standardized processing can be extracted in a variety of ways.
  • the WORLD toolkit can be used to extract the spectrum envelope, fundamental frequency and aperiodic signal parameters of the voice.
  • Step 230 Extract the first Mel cepstrum frequency coefficients of the first predetermined number of dimensions of the spectrum envelope.
  • Various methods can be used to extract the Mel frequency cepstral coefficient of the voice data, for example, it can be extracted by using the CodeSpectralEnvelope method of the WORLD toolkit.
  • the number of dimensions of the first Mel cepstrum frequency coefficient can be set in advance based on human experience or regulations, for example, it can be 39, 48, etc.
  • Step 240 Input the first Mel cepstrum frequency coefficients to the forward generator or the reverse generator of the pre-trained cyclic generation confrontation network model to obtain the output from the forward generator or the reverse generator The second Mel cepstrum frequency coefficient of the first predetermined number of dimensions.
  • the first mel cepstrum frequency coefficient is input to the forward generator
  • the first mel cepstrum frequency coefficient is input to the reverse generator
  • the cyclically generated confrontation network model includes a forward generator, a reverse generator, a forward discriminator, and a reverse discriminator, and the pre-trained cyclically generated confrontation network
  • the forward generator and the reverse generator of the model are trained by using the source speech data of the source speaker and the target speech data of the target speaker and generating the forward discriminator and the reverse discriminator of the confrontation network model based on the cycle.
  • the forward generator and the reverse generator of the pre-trained loop generation confrontation network model utilize the source voice data of the source speaker and the target voice data of the target speaker and generate confrontation based on the loop.
  • the forward discriminator and reverse discriminator of the network model are trained as follows:
  • the loop generation confrontation network model is obtained.
  • the forward confrontation loss is obtained by the forward discriminator, and is used to measure that after the forward generator converts the source voice data into pseudo target voice data, the target
  • the difference between the voice data and the pseudo-target voice data, the reverse confrontation loss is obtained by the reverse discriminator, and is used to measure the conversion of the target voice data into the pseudo-source voice by the reverse generator
  • the difference between the source voice data and the pseudo source voice data, and the forward loop consistency loss is used to measure the conversion of the source voice data into pseudo target voice data by the forward generator
  • the pseudo-target voice data is converted into cyclic source voice data by the reverse generator
  • the difference between the cyclic source voice data and the source voice data is used by the reverse cyclic consistency loss After measuring that the target voice data is converted into pseudo-source voice data by the reverse generator, and the pseudo-
  • the forward identity mapping loss is used to measure the difference between the target voice data and the target voice data after the forward generator converts the target voice data into the target identity voice data.
  • the difference between the target identity voice data, the reverse identity mapping loss is used to measure the conversion of the source voice data into the source identity voice data by the reverse generator, the source voice data and the source identity voice The difference between the data.
  • the forward generator and the reverse generator have the same structure, and the forward discriminator and the reverse discriminator have the same structure.
  • the pairing is realized.
  • the loop generation counters the optimization of the network model, thereby improving the dubbing effect.
  • the pre-trained cyclic generation confrontation network model can be deployed on the blockchain, which can improve security and facilitate the application of the model.
  • the forward generator and the reverse generator respectively include: a two-dimensional first convolution unit, a one-dimensional second convolution unit connected to the first convolution unit, A two-dimensional third convolution unit connected to the second convolution unit, wherein the output part of each convolution unit includes a gated linear unit.
  • the generator can have the ability to capture features more widely, and by setting a one-dimensional convolution unit in the middle of the generator, the sequence can be better processed Voice data.
  • Gated Linear Units can be used to avoid gradient loss during training.
  • the output part of each convolution unit includes a gated linear unit, that is, the activation function of each convolution unit is a gated linear unit.
  • the output layers of the forward discriminator and the reverse discriminator are both two-dimensional convolutional layers.
  • the authenticity judgment result output by the discriminator is no longer the True or False judgment of the entire speech signal, but the output An n ⁇ n matrix, each element of this matrix is a judgment result (True, False), which represents a subset of the speech signal.
  • a discriminator is more accurate in judging the details of the converted voice, and can make the output voice of the converted result sound clearer and more lifelike.
  • FIG. 3 it exemplarily shows a schematic diagram of the architecture of a cyclically generated confrontation network model used when the intelligent dubbing method provided by the present application is applied.
  • the cyclically generated confrontation network model includes a forward generator 320 and a reverse generator. 320', forward discriminator 340 and reverse discriminator 340'.
  • the source voice data 310 is input to the forward generator 320 and converted into pseudo-target voice data 330.
  • the pseudo-target voice data 330 and the target voice data 350 are input to the forward discriminator 340 for judgment to calculate the forward direction.
  • the target voice data 350 is input to the reverse generator 320' and converted into pseudo source voice data 330'.
  • the pseudo source voice data 330 and source voice data 310 are input to the reverse discriminator 340' for judgment , To calculate the reverse confrontation loss.
  • the pseudo-target voice data 330 will be sent to the reverse generator 320', and the pseudo-source voice data 330' will also be sent to the forward generator 320 to calculate the loop consistency loss; the target voice data 350 also It will be input to the forward generator 320, and the source voice data 310 will also be input to the reverse generator 320' to calculate the identity mapping loss.
  • the recurrent generation confrontation network model is obtained by training in the following manner:
  • the Mel frequency cepstrum coefficient is a first predetermined number of dimensions
  • the Mel frequency cepstrum coefficients of the source voice data and the target voice data are input to the loop generation countermeasure network model respectively, and the generators and discriminators of the loop generation countermeasure network model are calculated After the output of, calculate a loss function based on the output and update the parameters of the cyclically generated confrontation network model based on the output result of the loss function.
  • the predetermined duration is a duration set based on experience that can achieve a good training effect on the recurring generation confrontation network model, for example, it can be 10 minutes or 15 minutes.
  • the predetermined condition is a condition for stopping the training of the cyclic generation adversarial network model.
  • the predetermined condition may be, for example, that the number of iteratively performing training steps reaches a predetermined number of thresholds, the duration of iteratively performing training steps reaches a predetermined time length threshold, and the loss function The output result is less than a predetermined result threshold, etc.
  • Various methods can be used to extract the spectral envelope of the voice data, for example, by using the WORLD toolkit.
  • the method of updating the parameters of the cyclically generated confrontation network model may use a backpropagation algorithm.
  • Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear mel scale of the sound frequency.
  • Mel-Frequency Cepstral Coefficients are the coefficients that make up the Mel-Frequency Cepstral Coefficients. It is derived from the cepstrum of audio fragments. The difference between cepstrum and mel frequency cepstrum is that the band division of mel frequency cepstrum is equally spaced on the mel scale, which is more approximate than the linearly spaced frequency band used in the normal cepstrum The human auditory system. Such a non-linear representation can provide a better representation of the sound signal in multiple fields. For example, in audio compression.
  • the number of consecutive frames of voice data—the second predetermined number and the number of dimensions of the Mel frequency cepstral coefficient—the first predetermined number is set in advance based on factors such as experience.
  • the number of consecutive frames of voice data can be It is 128, and the number of dimensions of the Mel frequency cepstrum coefficient can be 32, 39, 48, and so on.
  • Various methods can be used to extract the Mel frequency cepstral coefficient of the voice data, for example, it can be extracted by using the CodeSpectralEnvelope method of the WORLD Voice Toolkit.
  • FIG. 4A shows a schematic diagram of the principle of the cycle consistency loss and the confrontation loss of the cycle generation confrontation network model.
  • the left side is the process of forward learning
  • the right side is the process of reverse learning.
  • x real is the source voice data
  • G X ⁇ Y is the forward generator
  • the source voice data is converted into pseudo target voice data y fake through G X ⁇ Y
  • the forward discriminator D Y is used to calculate the difference between the target voice data and the pseudo-target voice data, and the forward confrontation loss can be obtained.
  • the pseudo-target voice data y fake is converted into cyclic source voice through the reverse generator G Y ⁇ X
  • the data x cycle can calculate the difference between the cyclic source voice data x cycle and the source voice data x real , so as to obtain the consistency loss of the forward cycle.
  • reverse confrontation loss and reverse cycle consistency loss can be obtained.
  • FIG. 4B shows a schematic diagram of the principle of identity mapping loss of the cyclic generation counter network model.
  • the left side is the forward mapping process
  • the right side is the reverse mapping process.
  • After the forward generator converts the target voice data into target identity voice data, calculate the target voice data and the target voice data.
  • the difference between the identity voice data can get the forward identity mapping loss; in the same way, by inputting the source voice data to the reverse generator, the reverse generator converts the source voice data into the source identity voice data, which can be calculated Loss of reverse identity mapping.
  • the identity voice data is obtained by inputting the voice data into the generator and then outputting the generator.
  • the difference between the voice data and the identity voice data can be made The smallest, it can guarantee to the greatest extent that the generator retains the language content structure in the original voice when converting the voice.
  • Step 250 based on the source voice data of the source speaker, the target voice data of the target speaker, the second Mel cepstrum frequency coefficient, the fundamental frequency of the voice data and the non-periodic signal parameter generation and the waiting The voice of the target speaker or the source speaker relative to the dubbing speaker.
  • step 250 may include:
  • the speech to be generated is synthesized by using the spectrum envelope, aperiodic signal parameters and the fundamental frequency of the speech to be generated as the speech of the target speaker or the source speaker opposite to the speaker to be dubbed.
  • the spectrum envelope, aperiodic signal parameters, and fundamental frequency of the speech to be generated can be used in multiple ways to synthesize the speech to be generated.
  • the speech synthesis method of the WORLD toolkit can be used to synthesize a speech file.
  • the determining the aperiodic signal parameters of the voice to be generated according to the aperiodic signal parameters of the voice data includes:
  • the aperiodic signal parameter of the voice data is used as the aperiodic signal parameter of the voice to be generated.
  • the aperiodic signal parameter of the voice data is AP x
  • the aperiodic signal parameter of the voice to be generated is AP y-converted
  • the aperiodic signal parameters of the voice data are used as the aperiodic signal parameters of the voice to be generated.
  • the base frequency of the voice to be generated is determined based on the base frequency of the voice data, the average value and standard deviation of the base frequency of the source voice data, and the average value and standard deviation of the base frequency of the target voice data.
  • Frequency including:
  • the fundamental frequency of the voice to be generated is determined using the following formula:
  • F0 y-converted F0 normalized ⁇ F0_std y + F0_mean y ,
  • F0 x is the fundamental frequency of the voice data
  • F0_mean x is the average value of the fundamental frequency of the source voice data or target voice data corresponding to the speaker to be dubbed
  • F0_std x is the corresponding to the speaker to be dubbed
  • F0_mean y is the average value of the target voice data or the fundamental frequency of the source voice data corresponding to the speaker to be dubbed
  • F0_std y is the corresponding to the speaker to be dubbed
  • F0 normalized is the intermediate result
  • F0 y-converted is the fundamental frequency of the voice to be generated.
  • the speaker to be dubbed can be the source speaker and the target speaker
  • the above formula can be applied to any scenario where the speaker to be dubbed is the source speaker and the target speaker.
  • F0_mean x is the average value of the fundamental frequency of the source voice data
  • F0_std x is the standard deviation of the fundamental frequency of the source voice data
  • F0_mean y is the average value of the fundamental frequency of the target voice data
  • F0_std y is the standard deviation of the fundamental frequency of the target voice data
  • F0_std y is the standard deviation of the fundamental frequency of the target voice data
  • F0_mean x is the average value of the fundamental frequency of the target voice data
  • F0_std x is the fundamental frequency of the target voice data.
  • Standard deviation F0_mean y is the average value of the fundamental frequency of the source voice data
  • F0_std y is the standard deviation of the fundamental frequency of the source voice data.
  • the present application also provides an intelligent dubbing device, and the following are device embodiments of the present application.
  • Fig. 5 is a block diagram showing an intelligent dubbing device according to an exemplary embodiment. As shown in FIG. 5, the device 500 includes:
  • the acquiring module 510 is configured to acquire voice data of a speaker to be dubbed, the speaker to be dubbed being one of the source speaker and the target speaker;
  • the processing and extraction module 520 is configured to perform standardized processing on the voice data, and extract the spectrum envelope, fundamental frequency, and aperiodic signal parameters of the voice data after the standardized processing;
  • the extraction module 530 extracts the first Mel cepstrum frequency coefficients of the first predetermined number of dimensions of the spectrum envelope
  • the input module 540 is configured to input the first Mel cepstrum frequency coefficients to the forward generator or the reverse generator of the pre-trained cyclic generation confrontation network model to obtain the forward generator or the reverse generator.
  • the second mel cepstrum frequency coefficient of the first predetermined number of dimensions output to the generator wherein, when the speaker to be dubbed is the source speaker, the first mel cepstrum frequency coefficient is input to the forward generator
  • the first Mel cepstrum frequency coefficient is input to the reverse generator
  • the cyclically generated confrontation network model includes a forward generator, a reverse generator,
  • the forward discriminator and the reverse discriminator, the forward generator and the reverse generator of the pre-trained loop generation against the network model use the source speech data of the source speaker and the target speech data of the target speaker and are based on all The forward discriminator and the reverse discriminator of the cycle generation confrontation network model are trained;
  • the generating module 550 is configured to generate based on the source voice data of the source speaker, the target voice data of the target speaker, the second Mel cepstrum frequency coefficient, the fundamental frequency and aperiodic signal parameters of the voice data The voice of the target speaker or the source speaker opposite to the speaker to be dubbed.
  • the forward generator and the reverse generator of the pre-trained loop generation confrontation network model utilize the source voice data of the source speaker and the target voice data of the target speaker and generate confrontation based on the loop.
  • the forward discriminator and reverse discriminator of the network model are trained as follows:
  • the loop generation confrontation network model is obtained.
  • the forward confrontation loss is obtained by the forward discriminator, and is used to measure that after the forward generator converts the source voice data into pseudo target voice data, the target
  • the difference between the voice data and the pseudo-target voice data, the reverse confrontation loss is obtained by the reverse discriminator, and is used to measure the conversion of the target voice data into the pseudo-source voice by the reverse generator
  • the difference between the source voice data and the pseudo source voice data, and the forward loop consistency loss is used to measure the conversion of the source voice data into pseudo target voice data by the forward generator
  • the pseudo-target voice data is converted into cyclic source voice data by the reverse generator
  • the difference between the cyclic source voice data and the source voice data is used by the reverse cyclic consistency loss After measuring that the target voice data is converted into pseudo-source voice data by the reverse generator, and the pseudo-
  • the forward identity mapping loss is used to measure the difference between the target voice data and the target voice data after the forward generator converts the target voice data into the target identity voice data.
  • the difference between the target identity voice data, the reverse identity mapping loss is used to measure the conversion of the source voice data into the source identity voice data by the reverse generator, the source voice data and the source identity voice The difference between the data.
  • the forward generator and the reverse generator have the same structure, and the forward discriminator and the reverse discriminator have the same structure.
  • the standardized processing of the voice data includes:
  • the voice data is converted into a sampling rate of a predetermined frequency and a predetermined format.
  • the recurrent generation confrontation network model is obtained by training in the following manner:
  • the Mel frequency cepstrum coefficient is a first predetermined number of dimensions
  • the Mel frequency cepstrum coefficients of the source voice data and the target voice data are input to the loop generation countermeasure network model respectively, and the generators and discriminators of the loop generation countermeasure network model are calculated After the output of, calculate a loss function based on the output and update the parameters of the cyclically generated confrontation network model based on the output result of the loss function.
  • the generating module is further configured to:
  • the speech to be generated is synthesized by using the spectrum envelope, aperiodic signal parameters and the fundamental frequency of the speech to be generated as the speech of the target speaker or the source speaker opposite to the speaker to be dubbed.
  • the base frequency of the voice to be generated is determined based on the base frequency of the voice data, the average value and standard deviation of the base frequency of the source voice data, and the average value and standard deviation of the base frequency of the target voice data.
  • Frequency including:
  • the fundamental frequency of the voice to be generated is determined using the following formula:
  • F0 y-converted F0 normalized ⁇ F0_std y + F0_mean y ,
  • F0 x is the fundamental frequency of the voice data
  • F0_mean x is the average value of the fundamental frequency of the source voice data or target voice data corresponding to the speaker to be dubbed
  • F0_std x is the corresponding to the speaker to be dubbed
  • F0_mean y is the average value of the target voice data or the fundamental frequency of the source voice data corresponding to the speaker to be dubbed
  • F0_std y is the corresponding to the speaker to be dubbed
  • F0 normalized is the intermediate result
  • F0 y-converted is the fundamental frequency of the voice to be generated.
  • the computer equipment includes:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute as shown in any of the foregoing exemplary embodiments. Smart dubbing method.
  • the computer device 600 according to this embodiment of the present application will be described below with reference to FIG. 6.
  • the computer device 600 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the computer device 600 is represented in the form of a general-purpose computing device.
  • the components of the computer device 600 may include, but are not limited to: the aforementioned at least one processing unit 610, the aforementioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610).
  • the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation.
  • the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 620 may also include a program/utility tool 624 having a set of (at least one) program module 625.
  • program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 630 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the computer device 600 can also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable a user to interact with the computer device 600, and/or communicate with Any device (such as a router, a modem, etc.) that enables the computer device 600 to communicate with one or more other computer devices. This communication can be performed through an input/output (I/O) interface 650.
  • the computer device 600 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 660 communicates with other modules of the computer device 600 through the bus 630. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the computer device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the exemplary embodiments described herein can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computer device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computer device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which is stored a program product capable of realizing the above-mentioned method of this specification.
  • the computer-readable storage medium may be non-volatile or easy-to-use. Loss of sex.
  • various aspects of the present application can also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 700 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be stored in a terminal device, For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer equipment, partly on the user's equipment, executed as an independent software package, partly on the user's computer equipment and partly executed on the remote computer equipment, or entirely on the remote computer equipment or server. Executed on.
  • the remote computer equipment can be connected to the user’s computer equipment through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer equipment (for example, using Internet service providers). Shanglai is connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • an external computer equipment for example, using Internet service providers.
  • Shanglai is connected via the Internet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请涉及人工智能的语音处理领域,揭示了一种智能配音方法、装置、计算机设备和存储介质。该方法包括:获取待配音说话人的语音数据;对语音数据进行标准化处理,并提取经标准化处理后的语音数据的频谱包络、基频和非周期信号参数;提取频谱包络的第一梅尔倒谱频率系数;将第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由正向生成器或反向生成器输出的第二梅尔倒谱频率系数;基于源语音数据、目标语音数据、第二梅尔倒谱频率系数、语音数据的基频和非周期信号参数生成与待配音说话人相对的目标说话人或源说话人的语音。此方法能实现不同音色的人的语音转换,提高了配音效率,降低了配音成本。

Description

智能配音方法、装置、计算机设备和存储介质
本申请要求于2020年5月26日提交中国专利局、申请号为CN 202010457088.5,发明名称为“智能配音方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的语音处理技术领域,特别是涉及一种智能配音方法、装置、计算机设备和计算机可读存储介质。
背景技术
配音是影视娱乐领域的一项重要工作。发明人意识到,目前,为了完成某些配音任务,往往都要亲自去找有相应说话风格和音色的人来亲自进行配音,这种方式费时费力,效率很低。
发明内容
在人工智能的语音处理技术领域,为了解决上述技术问题,本申请的目的在于提供一种智能配音方法、装置、计算机设备和计算机可读存储介质。
第一方面,提供了一种智能配音方法,包括:
获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
第二方面,提供了一种智能配音装置,包括:
获取模块,被配置为获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
处理和提取模块,被配置为对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
提取模块,提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
输入模块,被配置为将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络 模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
生成模块,被配置为基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
第三方面,提供了一种计算机设备,包括存储器和处理器,所述存储器用于存储所述处理器的智能配音的程序,所述处理器配置为经由执行所述智能配音的程序来执行以下处理:获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
第四方面,提供了一种存储有计算机可读指令的计算机可读存储介质,其上存储有智能配音的程序,所述智能配音的程序被处理器执行时实现以下处理:获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
上述智能配音方法、装置、计算机设备和计算机可读存储介质,通过先对待配音说话人的语音数据进行标准化处理,然后提取标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数,接着提取一梅尔倒谱频率系数,然后将第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器之后得到第二梅尔倒谱频率系数,最终基于第二梅尔倒谱频率系数、源语音数据、目标语音数据、基频及非周期信号参数生成语音。仅需获取要进行音色转换的双方的语音数据就能自动实现不同音色的人的语音转换,从而便于将一个人的语音转换为另一人的语音,不需要专门找相应音色的人来配音,提高了配音效率,降低了配音成本。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图说明
图1是根据一示例性实施例示出的一种智能配音方法应用的系统架构示意图;
图2是根据一示例性实施例示出的一种智能配音方法的流程图;
图3是根据一示例性实施例示出的应用本申请提供的智能配音方法时所使用的循环生成对抗网络模型的架构示意图;
图4A是根据一示例性实施例示出的循环生成对抗网络模型的循环一致性损失和对抗损失的原理示意图;
图4B是根据一示例性实施例示出的循环生成对抗网络模型的身份映射损失的原理示意图;
图5是根据一示例性实施例示出的一种智能配音装置的框图;
图6是根据一示例性实施例示出的一种实现上述智能配音方法的计算机设备的示例框图;
图7是根据一示例性实施例示出的一种实现上述智能配音方法的计算机可读存储介质。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。
本申请首先提供了一种智能配音方法。智能配音是指将第一人的语音转换为第二人的语音,转换前后的两人的语音的内容不变,但转换前后的两人的语音的音色分别属于第一人和第二人。本申请提供的智能配音方法应用了人工智能技术实现了智能配音,该方法可以应用于多个领域,比如,在金融领域的智能营销和智能运营中,可以将客服座席的声音都转换成较为柔美的音色,可以提升接到电话者的体验,从而提升产品的销售;在教育领域中,可以应用在有声读物和在线教育,将线上讲课的老师的声音转换为学生喜欢的老师的声音,可以激发学习兴趣;在影视领域,可以实现智能配音,例如拍摄已故伟人的纪录片,可以提取其在历史影像视频中的声音,由配音演员进行纪录片的配音后,再将音色转换成已故伟人的声音,会让纪录片更加有时代感和真实感。
本申请的实施终端可以是任何具有运算处理和通信功能的设备,该设备可以与外部设备相连,用于接收或者发送数据,具体可以是便携移动设备,例如智能手机、平板电脑、笔记本电脑、PDA(Personal Digital Assistant)等,也可以是固定式设备,例如,计算机设备、现场终端、台式电脑、服务器、工作站等,还可以是多个设备的集合,比如云计算的物理基础设施或者服务器集群。
可选地,本申请的实施终端可以为服务器或者云计算的物理基础设施。
图1是根据一示例性实施例示出的一种智能配音方法应用的系统架构示意图。如图1所示,该系统架构包括终端110和服务器120,终端110与服务器120通过通信链路相连,用来接收或发送数据。用户可以通过终端110录入语音数据,服务器120为本申请的实施终端,服务器110上设有预先训练好的循环生成对抗网络模型,用户通过终端110录入的语音数据上传至服务器120后即可利用服务器110上的循环生成对抗网络模型进行配音。
值得一提的是,图1仅为本申请的一个实施例。虽然在图1中,用户,即待配音说话人的语音数据在终端录入语音数据并经由终端上传至本申请的实施终端,但在其他实 施例或者具体应用中,用户录入语音数据的终端和本申请的实施终端可以为同一终端;虽然在图1实施例中除了终端110之外未包含其他与服务器120相连的终端,但在其他实施例中,还可以包括其他终端与服务器120相连,比如,可以包括为服务器120上循环生成对抗网络模型的训练提供数据的终端。
图2是根据一示例性实施例示出的一种智能配音方法的流程图。本实施例提供的智能配音方法可以由服务器执行,如图2所示,包括以下步骤:
步骤210,获取待配音说话人的语音数据。
所述待配音说话人为源说话人和目标说话人中的一位。
待配音说话人的语音数据可以是以各种格式的语音格式存在,比如可以是CD格式、WAV格式、MP3格式等。
源说话人和目标说话人是相对于模型的训练而言的,所述待配音说话人为源说话人和目标说话人中的一位,这意味着,无论是待配音说话人是源说话人还是目标说话人,通过本申请提供的智能配音方法都可以将待配音说话人转换为相对的一方的语音。
步骤220,对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数。
在一个实施例中,所述对所述语音数据进行标准化处理,包括:
将所述语音数据转换为预定频率的采样率和预定格式。
比如,可以将语音数据统一转换为16000khz采样率,单通道的wav格式。
频谱包络是将不同频率的振幅最高点连结起来形成的曲线,就叫频谱包络线。
在声音中,基频是指一个复音中基音的频率。非周期信号参数是用于反映语音的音色的一种参数。
可以通过多种方式提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数,比如可以使用WORLD工具包来提取语音的频谱包络、基频和非周期信号参数。
步骤230,提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数。
可以采用各种方式来提取语音数据的梅尔频率倒谱系数,比如可以通过使用WORLD工具包的CodeSpectralEnvelope方法来提取。
第一梅尔倒谱频率系数的维度数可以是事先根据人为经验或者规定而设定的,比如可以是39、48等。
步骤240,将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数。
其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成。
在一个实施例中,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器通过如下方式训练而成:
利用源说话人的源语音数据和目标说话人的目标语音数据并基于正向循环一致性损失、反向循环一致性损失、正向对抗损失和正向身份映射损失训练得到所述循环生成对抗网络模型的正向生成器;
利用源说话人的源语音数据和目标说话人的目标语音数据并基于反向循环一致性损失、正向循环一致性损失、反向对抗损失和反向身份映射损失训练得到所述循环生成对 抗网络的反向生成器,其中,所述正向对抗损失由所述正向鉴别器获得,用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据后,所述目标语音数据与所述伪目标语音数据之间的差异,所述反向对抗损失由所述反向鉴别器获得,用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据后,所述源语音数据与所述伪源语音数据之间的差异,所述正向循环一致性损失用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据,并由所述反向生成器将所述伪目标语音数据转换为循环源语音数据后,所述循环源语音数据与所述源语音数据之间的差异,所述反向循环一致性损失用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据,并由所述正向生成器将所述伪源语音数据转换为循环目标语音数据后,所述循环目标语音数据与所述目标语音数据之间的差异,所述正向身份映射损失用于衡量由所述正向生成器将所述目标语音数据转换为目标身份语音数据后,所述目标语音数据与所述目标身份语音数据之间的差异,所述反向身份映射损失用于衡量由所述反向生成器将源语音数据转换为源身份语音数据后,所述源语音数据与所述源身份语音数据之间的差异。
在一个实施例中,所述正向生成器与所述反向生成器的结构相同,所述正向鉴别器与所述反向鉴别器的结构相同。
在本实施例中,通过将正向生成器与反向生成器的结构设计为相同的结构,并将正向鉴别器与所述反向鉴别器的结构也设计为相同的结构,实现了对循环生成对抗网络模型的优化,从而提高了配音效果。
可以将预先训练好的循环生成对抗网络模型部署在区块链上,可以提高安全性,也可以便于模型的应用。
在一个实施例中,所述正向生成器和所述反向生成器分别包括:二维的第一卷积单元,与所述第一卷积单元相连的一维的第二卷积单元、与所述第二卷积单元相连的二维的第三卷积单元,其中,每一卷积单元的输出部分包括门控线性单元。
通过在生成器的前后部分设置二维的卷积单元,可以使生成器具有更广泛地捕获特征的能力,而通过在生成器的中间部分设置一维的卷积单元,可以更好地处理序列的语音数据。
门控线性单元(GLU,Gated Linear Units)可以用于避免训练过程中的梯度丧失。每一卷积单元的输出部分包括门控线性单元即每一卷积单元的激活函数为门控线性单元。
在一个实施例中,所述正向鉴别器和反向鉴别器的输出层均为二维的卷积层。
在正向鉴别器和反向鉴别器的输出层均为二维的卷积层的情况下,鉴别器输出的判定真伪的结果不再是整个语音信号的True或False的判定,而是输出一个n×n的矩阵,这个矩阵的每个元素都是一个判定结果(True,False),其代表着语音信号的一个子集。这样的鉴别器在对转换语音细节的判定上更加准确,能够让输出的转换结果的语音听起来更加清晰、逼真。
参见图3所示,示例性地示出了应用本申请提供的智能配音方法时所使用的循环生成对抗网络模型的架构示意图,该循环生成对抗网络模型包括正向生成器320、反向生成器320′、正向鉴别器340和反向鉴别器340′。对于模型的第一部分,源语音数据310输入至正向生成器320转换为伪目标语音数据330,伪目标语音数据330和目标语音数据350会被输入至正向鉴别器340进行判断,以计算正向对抗损失。对于模型的第二部分,目标语音数据350输入至反向生成器320′转换为伪源语音数据330′,伪源语音数据330和源语音数据310会被输入至反向鉴别器340′进行判断,以计算反向对抗损失。另外,伪目标语音数据330会被送入至反向生成器320′,伪源语音数据330′也会被送入至正向生成器320,用于计算循环一致性损失;目标语音数据350还会被输入至正向生成器320,源语音数据310还会被输入至反向生成器320′,用于计算身份映射损失。
在一个实施例中,所述循环生成对抗网络模型通过如下的方式训练得到:
分别获取源说话人的源语音数据和目标说话人的目标语音数据,所述源语音数据和所述目标语音数据的时长分别超过预定时长;
分别对所述源语音数据和所述目标语音数据进行标准化处理,并提取经标准化处理后的所述源语音数据和所述目标语音数据的频谱包络;
迭代执行下列训练步骤,直至对所述循环生成对抗网络模型的训练达到预定条件:
利用所述源语音数据和所述目标语音数据的频谱包络,分别提取所述源语音数据和所述目标语音数据的连续第二预定数目帧语音数据所对应的梅尔频率倒谱系数,其中,所述梅尔频率倒谱系数为第一预定数目维;
分别将所述源语音数据和所述目标语音数据的所述梅尔频率倒谱系数输入至所述循环生成对抗网络模型,并在计算出所述循环生成对抗网络模型的各生成器和鉴别器的输出后,基于所述输出计算损失函数并基于所述损失函数的输出结果更新所述循环生成对抗网络模型的参数。
预定时长是根据经验设定的可以对循环生成对抗网络模型实现很好的训练效果的时长,比如可以是10分钟或15分钟。
预定条件是停止对所述循环生成对抗网络模型的训练的条件,预定条件比如可以是迭代执行训练步骤的次数达到预定次数阈值、迭代执行训练步骤的时长达到预定时间长度阈值、所述损失函数的输出结果小于预定结果阈值等。
可以采用各种方式来提取语音数据的频谱包络,比如可以通过使用WORLD工具包进行提取。
更新所述循环生成对抗网络模型的参数的方式可以利用反向传播算法。
在声音处理领域中,梅尔频率倒谱(Mel-Frequency Cepstrum)是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。
梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCCs)就是组成梅尔频率倒谱的系数。它衍生自音讯片段的倒频谱(cepstrum)。倒谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统。这样的非线性表示,可以在多个领域中使声音信号有更好的表示。例如在音讯压缩中。
连续帧语音数据的数目——第二预定数目以及梅尔频率倒谱系数的维度数——第一预定数目,都是预先根据经验等因素来设定的,比如,连续帧语音数据的数目可以为128,而梅尔频率倒谱系数的维度数可以为32、39、48等。
可以采用各种方式来提取语音数据的梅尔频率倒谱系数,比如可以通过使用WORLD语音工具包的CodeSpectralEnvelope方法来提取。
参见图4A所示,示出了所述循环生成对抗网络模型的循环一致性损失和对抗损失的原理示意图,左侧为正向学习的过程,右侧为反向学习的过程。参见图4A左侧,假设x real为源语音数据,G X→Y为正向生成器,通过G X→Y将源语音数据转换为伪目标语音数据y fake,此时通过正向鉴别器D Y来计算目标语音数据与所述伪目标语音数据之间的差异,可以得到正向对抗损失,接着,通过G Y→X这一反向生成器将伪目标语音数据y fake转换为循环源语音数据x cycle,可以计算循环源语音数据x cycle与源语音数据x real之间的差异,从而得到正向循环一致性损失。同理,通过图4A右侧示出的过程,可以得到反向对抗损失和反向循环一致性损失。
参见图4B所示,示出了所述循环生成对抗网络模型的身份映射损失的原理示意图,左侧为正向映射过程,右侧为反向映射过程。参见左侧的正向映射过程,先将目标语音数据输入至正向生成器,在由正向生成器将目标语音数据转换为目标身份语音数据后,通过计算所述目标语音数据与所述目标身份语音数据之间的差异,可以得到正向身份映射损失;同理,通过将源语音数据输入至反向生成器,由反向生成器将源语音数据转换 为源身份语音数据,可以计算得到反向身份映射损失。
身份语音数据是通过将语音数据输入至生成器后,由生成器输出而得到的,通过基于语音数据与身份语音数据之间的差异计算损失函数,可以使语音数据与身份语音数据之间的差异最小,可以最大程度地保证生成器对语音进行转换时,保留原语音中的语言内容结构。
步骤250,基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
在一个实施例中,步骤250可以包括:
根据所述源说话人的源语音数据和目标说话人的目标语音数据分别确定所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差;
利用所述第二梅尔倒谱频率系数恢复要生成的语音的频谱包络;
根据所述语音数据的非周期信号参数确定要生成的语音的非周期信号参数;
基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频;
利用所述要生成的语音的频谱包络、非周期信号参数和基频合成要生成的语音,作为与所述待配音说话人相对的目标说话人或源说话人的语音。
可以采用多种方式利用所述要生成的语音的频谱包络、非周期信号参数和基频合成要生成的语音,比如可以使用WORLD工具包的语音合成方法来合成语音文件。
在一个实施例中,所述根据所述语音数据的非周期信号参数确定要生成的语音的非周期信号参数,包括:
将所述语音数据的非周期信号参数作为要生成的语音的非周期信号参数。
比如,若所述语音数据的非周期信号参数为AP x,而要生成的语音的非周期信号参数为AP y-converted,那么通过令AP y-converted=AP x进行赋值,即可将所述语音数据的非周期信号参数作为要生成的语音的非周期信号参数。
在一个实施例中,所述基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频,包括:
基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,利用如下公式确定要生成的语音的基频:
Figure PCTCN2020105266-appb-000001
F0 y-converted=F0 normalized×F0_std y+F0_mean y,
其中,F0 x为所述语音数据的基频,F0_mean x为所述待配音说话人所对应的源语音数据或目标语音数据基频的平均值,F0_std x为所述待配音说话人所对应的源语音数据或目标语音数据基频的标准差,F0_mean y为所述待配音说话人所对应的目标语音数据或源语音数据基频的平均值,F0_std y为所述待配音说话人所对应的目标语音数据或源语音数据基频的标准差,F0 normalized为中间结果,F0 y-converted为要生成的语音的基频。
由于所述待配音说话人可以为源说话人和目标说话人,上述公式可以同时应用于待配音说话人为源说话人、目标说话人的任意一种场景。一方面,当所述待配音说话人为源说话人时,F0_mean x为源语音数据基频的平均值,F0_std x为源语音数据基频的标准差,F0_mean y为目标语音数据基频的平均值,F0_std y为目标语音数据基频的标准差;另一方面,当所述待配音说话人为目标说话人时,F0_mean x为目标语音数据基频的平均值,F0_std x为目标语音数据基频的标准差,F0_mean y为源语音数据基频的平均值,F0_std y为源语音数据基频的标准差。
本申请还提供了一种智能配音装置,以下是本申请的装置实施例。
图5是根据一示例性实施例示出的一种智能配音装置的框图。如图5所示,装置500包括:
获取模块510,被配置为获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
处理和提取模块520,被配置为对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
提取模块530,提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
输入模块540,被配置为将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
生成模块550,被配置为基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
在一个实施例中,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器通过如下方式训练而成:
利用源说话人的源语音数据和目标说话人的目标语音数据并基于正向循环一致性损失、反向循环一致性损失、正向对抗损失和正向身份映射损失训练得到所述循环生成对抗网络模型的正向生成器;
利用源说话人的源语音数据和目标说话人的目标语音数据并基于反向循环一致性损失、正向循环一致性损失、反向对抗损失和反向身份映射损失训练得到所述循环生成对抗网络的反向生成器,其中,所述正向对抗损失由所述正向鉴别器获得,用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据后,所述目标语音数据与所述伪目标语音数据之间的差异,所述反向对抗损失由所述反向鉴别器获得,用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据后,所述源语音数据与所述伪源语音数据之间的差异,所述正向循环一致性损失用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据,并由所述反向生成器将所述伪目标语音数据转换为循环源语音数据后,所述循环源语音数据与所述源语音数据之间的差异,所述反向循环一致性损失用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据,并由所述正向生成器将所述伪源语音数据转换为循环目标语音数据后,所述循环目标语音数据与所述目标语音数据之间的差异,所述正向身份映射损失用于衡量由所述正向生成器将所述目标语音数据转换为目标身份语音数据后,所述目标语音数据与所述目标身份语音数据之间的差异,所述反向身份映射损失用于衡量由所述反向生成器将源语音数据转换为源身份语音数据后,所述源语音数据与所述源身份语音数据之间的差异。
在一个实施例中,所述正向生成器与所述反向生成器的结构相同,所述正向鉴别器与所述反向鉴别器的结构相同。
在一个实施例中,所述对所述语音数据进行标准化处理,包括:
将所述语音数据转换为预定频率的采样率和预定格式。
在一个实施例中,所述循环生成对抗网络模型通过如下的方式训练得到:
分别获取源说话人的源语音数据和目标说话人的目标语音数据,所述源语音数据和 所述目标语音数据的时长分别超过预定时长;
分别对所述源语音数据和所述目标语音数据进行标准化处理,并提取经标准化处理后的所述源语音数据和所述目标语音数据的频谱包络;
迭代执行下列训练步骤,直至对所述循环生成对抗网络模型的训练达到预定条件:
利用所述源语音数据和所述目标语音数据的频谱包络,分别提取所述源语音数据和所述目标语音数据的连续第二预定数目帧语音数据所对应的梅尔频率倒谱系数,其中,所述梅尔频率倒谱系数为第一预定数目维;
分别将所述源语音数据和所述目标语音数据的所述梅尔频率倒谱系数输入至所述循环生成对抗网络模型,并在计算出所述循环生成对抗网络模型的各生成器和鉴别器的输出后,基于所述输出计算损失函数并基于所述损失函数的输出结果更新所述循环生成对抗网络模型的参数。
在一个实施例中,所述生成模块被进一步配置为:
根据所述源说话人的源语音数据和目标说话人的目标语音数据分别确定所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差;
利用所述第二梅尔倒谱频率系数恢复要生成的语音的频谱包络;
根据所述语音数据的非周期信号参数确定要生成的语音的非周期信号参数;
基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频;
利用所述要生成的语音的频谱包络、非周期信号参数和基频合成要生成的语音,作为与所述待配音说话人相对的目标说话人或源说话人的语音。
在一个实施例中,所述基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频,包括:
基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,利用如下公式确定要生成的语音的基频:
Figure PCTCN2020105266-appb-000002
F0 y-converted=F0 normalized×F0_std y+F0_mean y,
其中,F0 x为所述语音数据的基频,F0_mean x为所述待配音说话人所对应的源语音数据或目标语音数据基频的平均值,F0_std x为所述待配音说话人所对应的源语音数据或目标语音数据基频的标准差,F0_mean y为所述待配音说话人所对应的目标语音数据或源语音数据基频的平均值,F0_std y为所述待配音说话人所对应的目标语音数据或源语音数据基频的标准差,F0 normalized为中间结果,F0 y-converted为要生成的语音的基频。
根据本申请的第三方面,还提供了一种计算机设备,执行上述任一所示的智能配音方法的全部或者部分步骤。该计算机设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上述任一个示例性实施例所示出的智能配音方法。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图6来描述根据本申请的这种实施方式的计算机设备600。图6显示的计算机设备600仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算机设备600以通用计算设备的形式表现。计算机设备600的组件可以包括但不限于:上述至少一个处理单元610、上述至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元610执行,使得所述处理单元610执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。
存储单元620可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)621和/或高速缓存存储单元622,还可以进一步包括只读存储单元(ROM)623。
存储单元620还可以包括具有一组(至少一个)程序模块625的程序/实用工具624,这样的程序模块625包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
计算机设备600也可以与一个或多个外部设备800(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该计算机设备600交互的设备通信,和/或与使得该计算机设备600能与一个或多个其它计算机设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,计算机设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器660通过总线630与计算机设备600的其它模块通信。应当明白,尽管图中未示出,可以结合计算机设备600使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算机设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请的第四方面,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品,所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图7所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品700,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存 储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算机设备上部分在远程计算机设备上执行、或者完全在远程计算机设备或服务器上执行。在涉及远程计算机设备的情形中,远程计算机设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算机设备,或者,可以连接到外部计算机设备(例如利用因特网服务提供商来通过因特网连接)。
此外,上述附图仅是根据本申请示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种智能配音方法,包括:
    获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
    对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
    提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
    将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
    基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
  2. 根据权利要求1所述的方法,其中,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器通过如下方式训练而成:
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于正向循环一致性损失、反向循环一致性损失、正向对抗损失和正向身份映射损失训练得到所述循环生成对抗网络模型的正向生成器;
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于反向循环一致性损失、正向循环一致性损失、反向对抗损失和反向身份映射损失训练得到所述循环生成对抗网络的反向生成器,其中,所述正向对抗损失由所述正向鉴别器获得,用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据后,所述目标语音数据与所述伪目标语音数据之间的差异,所述反向对抗损失由所述反向鉴别器获得,用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据后,所述源语音数据与所述伪源语音数据之间的差异,所述正向循环一致性损失用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据,并由所述反向生成器将所述伪目标语音数据转换为循环源语音数据后,所述循环源语音数据与所述源语音数据之间的差异,所述反向循环一致性损失用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据,并由所述正向生成器将所述伪源语音数据转换为循环目标语音数据后,所述循环目标语音数据与所述目标语音数据之间的差异,所述正向身份映射损失用于衡量由所述正向生成器将所述目标语音数据转换为目标身份语音数据后,所述目标语音数据与所述目标身份语音数据之间的差异,所述反向身份映射损失用于衡量由所述反向生成器将源语音数据转换为源身份语音数据后,所述源语音数据与所述源身份语音数据之间的差异。
  3. 根据权利要求2所述的方法,其中,所述正向生成器与所述反向生成器的结构相同,所述正向鉴别器与所述反向鉴别器的结构相同。
  4. 根据权利要求1所述的方法,其中,所述对所述语音数据进行标准化处理,包括:
    将所述语音数据转换为预定频率的采样率和预定格式。
  5. 根据权利要求2或3所述的方法,其中,所述循环生成对抗网络模型通过如下的方 式训练得到:
    分别获取源说话人的源语音数据和目标说话人的目标语音数据,所述源语音数据和所述目标语音数据的时长分别超过预定时长;
    分别对所述源语音数据和所述目标语音数据进行标准化处理,并提取经标准化处理后的所述源语音数据和所述目标语音数据的频谱包络;
    迭代执行下列训练步骤,直至对所述循环生成对抗网络模型的训练达到预定条件:
    利用所述源语音数据和所述目标语音数据的频谱包络,分别提取所述源语音数据和所述目标语音数据的连续第二预定数目帧语音数据所对应的梅尔频率倒谱系数,其中,所述梅尔频率倒谱系数为第一预定数目维;
    分别将所述源语音数据和所述目标语音数据的所述梅尔频率倒谱系数输入至所述循环生成对抗网络模型,并在计算出所述循环生成对抗网络模型的各生成器和鉴别器的输出后,基于所述输出计算损失函数并基于所述损失函数的输出结果更新所述循环生成对抗网络模型的参数。
  6. 根据权利要求2或3所述的方法,其中,所述基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音,包括:
    根据所述源说话人的源语音数据和目标说话人的目标语音数据分别确定所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差;
    利用所述第二梅尔倒谱频率系数恢复要生成的语音的频谱包络;
    根据所述语音数据的非周期信号参数确定要生成的语音的非周期信号参数;
    基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频;
    利用所述要生成的语音的频谱包络、非周期信号参数和基频合成要生成的语音,作为与所述待配音说话人相对的目标说话人或源说话人的语音。
  7. 根据权利要求6所述的方法,其中,所述基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频,包括:
    基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,利用如下公式确定要生成的语音的基频:
    Figure PCTCN2020105266-appb-100001
    F0 y-converted=F0 normalized×F0_std y+F0_mean y,
    其中,F0 x为所述语音数据的基频,F0_mean x为所述待配音说话人所对应的源语音数据或目标语音数据基频的平均值,F0_std x为所述待配音说话人所对应的源语音数据或目标语音数据基频的标准差,F0_mean y为所述待配音说话人所对应的目标语音数据或源语音数据基频的平均值,F0_std y为所述待配音说话人所对应的目标语音数据或源语音数据基频的标准差,F0 normalized为中间结果,F0 y-converted为要生成的语音的基频。
  8. 一种智能配音装置,包括:
    获取模块,被配置为获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
    处理和提取模块,被配置为对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
    提取模块,提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
    输入模块,被配置为将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数 目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
    生成模块,被配置为基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行:
    获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
    对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
    提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
    将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
    基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
  10. 根据权利要求9所述的计算机设备,其中,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器通过如下方式训练而成:
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于正向循环一致性损失、反向循环一致性损失、正向对抗损失和正向身份映射损失训练得到所述循环生成对抗网络模型的正向生成器;
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于反向循环一致性损失、正向循环一致性损失、反向对抗损失和反向身份映射损失训练得到所述循环生成对抗网络的反向生成器,其中,所述正向对抗损失由所述正向鉴别器获得,用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据后,所述目标语音数据与所述伪目标语音数据之间的差异,所述反向对抗损失由所述反向鉴别器获得,用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据后,所述源语音数据与所述伪源语音数据之间的差异,所述正向循环一致性损失用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据,并由所述反向生成器将所述伪目标语音数据转换为循环源语音数据后,所述循环源语音数据与所述源语音数据之间的差异,所述反向循环一致性损失用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据,并由所述正向生成器将所述伪源语音数据转换为循环目标语音数据后,所述循环目标语音数据与所述目标语音数据之间的差异,所述正向身份映射损失用于衡量由所述正向生成器将所述目标语音数据转换为目标身份语音数据后,所述目标语音数据与所述目标身份语音数据之间的差异,所述反向身份映射损失用于衡量由所述反向生 成器将源语音数据转换为源身份语音数据后,所述源语音数据与所述源身份语音数据之间的差异。
  11. 根据权利要求10所述的计算机设备,其中,所述正向生成器与所述反向生成器的结构相同,所述正向鉴别器与所述反向鉴别器的结构相同。
  12. 根据权利要求9所述的计算机设备,其中,所述对所述语音数据进行标准化处理,包括:
    将所述语音数据转换为预定频率的采样率和预定格式。
  13. 根据权利要求10或11所述的计算机设备,其中,所述循环生成对抗网络模型通过如下的方式训练得到:
    分别获取源说话人的源语音数据和目标说话人的目标语音数据,所述源语音数据和所述目标语音数据的时长分别超过预定时长;
    分别对所述源语音数据和所述目标语音数据进行标准化处理,并提取经标准化处理后的所述源语音数据和所述目标语音数据的频谱包络;
    迭代执行下列训练步骤,直至对所述循环生成对抗网络模型的训练达到预定条件:
    利用所述源语音数据和所述目标语音数据的频谱包络,分别提取所述源语音数据和所述目标语音数据的连续第二预定数目帧语音数据所对应的梅尔频率倒谱系数,其中,所述梅尔频率倒谱系数为第一预定数目维;
    分别将所述源语音数据和所述目标语音数据的所述梅尔频率倒谱系数输入至所述循环生成对抗网络模型,并在计算出所述循环生成对抗网络模型的各生成器和鉴别器的输出后,基于所述输出计算损失函数并基于所述损失函数的输出结果更新所述循环生成对抗网络模型的参数。
  14. 根据权利要求10或11所述的计算机设备,其中,所述基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音,包括:
    根据所述源说话人的源语音数据和目标说话人的目标语音数据分别确定所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差;
    利用所述第二梅尔倒谱频率系数恢复要生成的语音的频谱包络;
    根据所述语音数据的非周期信号参数确定要生成的语音的非周期信号参数;
    基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频;
    利用所述要生成的语音的频谱包络、非周期信号参数和基频合成要生成的语音,作为与所述待配音说话人相对的目标说话人或源说话人的语音。
  15. 根据权利要求14所述的计算机设备,其中,所述基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,确定要生成的语音的基频,包括:
    基于所述语音数据的基频、所述源语音数据基频的平均值和标准差以及目标语音数据基频的平均值和标准差,利用如下公式确定要生成的语音的基频:
    Figure PCTCN2020105266-appb-100002
    F0 y-converted=F0 normalized×F0_std y+F0_mean y,
    其中,F0 x为所述语音数据的基频,F0_mean x为所述待配音说话人所对应的源语音数据或目标语音数据基频的平均值,F0_std x为所述待配音说话人所对应的源语音数据或目标语音数据基频的标准差,F0_mean y为所述待配音说话人所对应的目标语音数据或源语音数据基频的平均值,F0_std y为所述待配音说话人所对应的目标语音数据或源语音数据基频的标准差,F0 normalized为中间结果,F0 y-converted为要生成的语音的基频。
  16. 一种存储有计算机可读指令的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行:
    获取待配音说话人的语音数据,所述待配音说话人为源说话人和目标说话人中的一位;
    对所述语音数据进行标准化处理,并提取经标准化处理后的所述语音数据的频谱包络、基频和非周期信号参数;
    提取所述频谱包络的第一预定数目维的第一梅尔倒谱频率系数;
    将所述第一梅尔倒谱频率系数输入至预先训练好的循环生成对抗网络模型的正向生成器或反向生成器,得到由所述正向生成器或反向生成器输出的第一预定数目维的第二梅尔倒谱频率系数,其中,在所述待配音说话人为源说话人时,将所述第一梅尔倒谱频率系数输入至正向生成器,在所述待配音说话人为目标说话人时,将所述第一梅尔倒谱频率系数输入至反向生成器,所述循环生成对抗网络模型包括正向生成器、反向生成器、正向鉴别器和反向鉴别器,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器训练而成;
    基于所述源说话人的源语音数据、目标说话人的目标语音数据、所述第二梅尔倒谱频率系数、所述语音数据的基频和非周期信号参数生成与所述待配音说话人相对的目标说话人或源说话人的语音。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述预先训练好的循环生成对抗网络模型的正向生成器和反向生成器利用源说话人的源语音数据和目标说话人的目标语音数据并基于所述循环生成对抗网络模型的正向鉴别器和反向鉴别器通过如下方式训练而成:
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于正向循环一致性损失、反向循环一致性损失、正向对抗损失和正向身份映射损失训练得到所述循环生成对抗网络模型的正向生成器;
    利用源说话人的源语音数据和目标说话人的目标语音数据并基于反向循环一致性损失、正向循环一致性损失、反向对抗损失和反向身份映射损失训练得到所述循环生成对抗网络的反向生成器,其中,所述正向对抗损失由所述正向鉴别器获得,用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据后,所述目标语音数据与所述伪目标语音数据之间的差异,所述反向对抗损失由所述反向鉴别器获得,用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据后,所述源语音数据与所述伪源语音数据之间的差异,所述正向循环一致性损失用于衡量由所述正向生成器将所述源语音数据转换为伪目标语音数据,并由所述反向生成器将所述伪目标语音数据转换为循环源语音数据后,所述循环源语音数据与所述源语音数据之间的差异,所述反向循环一致性损失用于衡量由所述反向生成器将所述目标语音数据转换为伪源语音数据,并由所述正向生成器将所述伪源语音数据转换为循环目标语音数据后,所述循环目标语音数据与所述目标语音数据之间的差异,所述正向身份映射损失用于衡量由所述正向生成器将所述目标语音数据转换为目标身份语音数据后,所述目标语音数据与所述目标身份语音数据之间的差异,所述反向身份映射损失用于衡量由所述反向生成器将源语音数据转换为源身份语音数据后,所述源语音数据与所述源身份语音数据之间的差异。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述正向生成器与所述反向生成器的结构相同,所述正向鉴别器与所述反向鉴别器的结构相同。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述语音数据进行标准化处理,包括:
    将所述语音数据转换为预定频率的采样率和预定格式。
  20. 根据权利要求17或18所述的计算机可读存储介质,其中,所述循环生成对抗网络模型通过如下的方式训练得到:
    分别获取源说话人的源语音数据和目标说话人的目标语音数据,所述源语音数据和所述目标语音数据的时长分别超过预定时长;
    分别对所述源语音数据和所述目标语音数据进行标准化处理,并提取经标准化处理后的所述源语音数据和所述目标语音数据的频谱包络;
    迭代执行下列训练步骤,直至对所述循环生成对抗网络模型的训练达到预定条件:
    利用所述源语音数据和所述目标语音数据的频谱包络,分别提取所述源语音数据和所述目标语音数据的连续第二预定数目帧语音数据所对应的梅尔频率倒谱系数,其中,所述梅尔频率倒谱系数为第一预定数目维;
    分别将所述源语音数据和所述目标语音数据的所述梅尔频率倒谱系数输入至所述循环生成对抗网络模型,并在计算出所述循环生成对抗网络模型的各生成器和鉴别器的输出后,基于所述输出计算损失函数并基于所述损失函数的输出结果更新所述循环生成对抗网络模型的参数。
PCT/CN2020/105266 2020-05-26 2020-07-28 智能配音方法、装置、计算机设备和存储介质 WO2021237923A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010457088.5A CN111696520A (zh) 2020-05-26 2020-05-26 智能配音方法、装置、介质及电子设备
CN202010457088.5 2020-05-26

Publications (1)

Publication Number Publication Date
WO2021237923A1 true WO2021237923A1 (zh) 2021-12-02

Family

ID=72478291

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105266 WO2021237923A1 (zh) 2020-05-26 2020-07-28 智能配音方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN111696520A (zh)
WO (1) WO2021237923A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466314A (zh) * 2020-11-27 2021-03-09 平安科技(深圳)有限公司 情感语音数据转换方法、装置、计算机设备及存储介质
CN113421576B (zh) * 2021-06-29 2024-05-24 平安科技(深圳)有限公司 语音转换方法、装置、设备以及存储介质
CN114283825A (zh) * 2021-12-24 2022-04-05 北京达佳互联信息技术有限公司 一种语音处理方法、装置、电子设备及存储介质
CN115132204B (zh) * 2022-06-10 2024-03-22 腾讯科技(深圳)有限公司 一种语音处理方法、设备、存储介质及计算机程序产品
CN115064177A (zh) * 2022-06-14 2022-09-16 中国第一汽车股份有限公司 基于声纹编码器的语音转换方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190013037A1 (en) * 2017-01-23 2019-01-10 Dsp Group Ltd. Interface to leaky spiking neurons
CN110246504A (zh) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 鸟类声音识别方法、装置、计算机设备和存储介质
CN110459232A (zh) * 2019-07-24 2019-11-15 浙江工业大学 一种基于循环生成对抗网络的语音转换方法
CN110633698A (zh) * 2019-09-30 2019-12-31 上海依图网络科技有限公司 基于循环生成对抗网络的红外图片识别方法、设备及介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683680B (zh) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 说话人识别方法及装置、计算机设备及计算机可读介质
CN109377978B (zh) * 2018-11-12 2021-01-26 南京邮电大学 非平行文本条件下基于i向量的多对多说话人转换方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190013037A1 (en) * 2017-01-23 2019-01-10 Dsp Group Ltd. Interface to leaky spiking neurons
CN110246504A (zh) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 鸟类声音识别方法、装置、计算机设备和存储介质
CN110459232A (zh) * 2019-07-24 2019-11-15 浙江工业大学 一种基于循环生成对抗网络的语音转换方法
CN110633698A (zh) * 2019-09-30 2019-12-31 上海依图网络科技有限公司 基于循环生成对抗网络的红外图片识别方法、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, TAO: "Voice Conversion Based on CycleGAN Network under Non-Parallel Corpus", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, CHINA MASTER’S THESES FULL-TEXT DATABASE, no. 02, 15 February 2019 (2019-02-15), pages 1 - 52, XP009532120, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN111696520A (zh) 2020-09-22

Similar Documents

Publication Publication Date Title
WO2021237923A1 (zh) 智能配音方法、装置、计算机设备和存储介质
CN110298906B (zh) 用于生成信息的方法和装置
WO2020147404A1 (zh) 文本的语音合成方法、装置、计算机设备及计算机非易失性可读存储介质
WO2022007438A1 (zh) 情感语音数据转换方法、装置、计算机设备及存储介质
WO2021208728A1 (zh) 基于神经网络的语音端点检测方法、装置、设备及介质
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN110415686A (zh) 语音处理方法、装置、介质、电子设备
CN112562691A (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
WO2020248393A1 (zh) 语音合成方法、系统、终端设备和可读存储介质
WO2022178969A1 (zh) 语音对话数据处理方法、装置、计算机设备及存储介质
WO2020147407A1 (zh) 一种会议记录生成方法、装置、存储介质及计算机设备
WO2021114841A1 (zh) 一种用户报告的生成方法及终端设备
WO2020238045A1 (zh) 智能语音识别方法、装置及计算机可读存储介质
TW201543467A (zh) 語音輸入方法、裝置和系統
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2021174883A1 (zh) 声纹核身模型训练方法、装置、介质及电子设备
WO2021120779A1 (zh) 一种基于人机对话的用户画像构建方法、系统、终端及存储介质
US20220383876A1 (en) Method of converting speech, electronic device, and readable storage medium
CN113658583B (zh) 一种基于生成对抗网络的耳语音转换方法、系统及其装置
WO2021072870A1 (zh) 基于对抗网络的指纹模型生成方法以及相关装置
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
WO2021159669A1 (zh) 系统安全登录方法、装置、计算机设备和存储介质
KR20190005103A (ko) 전자기기의 웨이크업 방법, 장치, 디바이스 및 컴퓨터 가독 기억매체
WO2023116660A2 (zh) 一种模型训练以及音色转换方法、装置、设备及介质
WO2021127982A1 (zh) 语音情感识别方法、智能装置和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20937523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20937523

Country of ref document: EP

Kind code of ref document: A1