CN111276119B - Speech generation method, system and computer equipment - Google Patents

Speech generation method, system and computer equipment Download PDF

Info

Publication number
CN111276119B
CN111276119B CN202010052356.5A CN202010052356A CN111276119B CN 111276119 B CN111276119 B CN 111276119B CN 202010052356 A CN202010052356 A CN 202010052356A CN 111276119 B CN111276119 B CN 111276119B
Authority
CN
China
Prior art keywords
spectrogram
voice
user
attribute
edited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010052356.5A
Other languages
Chinese (zh)
Other versions
CN111276119A (en
Inventor
马坤
赵之砚
施奕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010052356.5A priority Critical patent/CN111276119B/en
Publication of CN111276119A publication Critical patent/CN111276119A/en
Application granted granted Critical
Publication of CN111276119B publication Critical patent/CN111276119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention provides a voice generation method, which comprises the following steps: acquiring user audio data and converting the user audio data into a user voice spectrogram; extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes; acquiring audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited; generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and generating a voice signal for output according to the target voice spectrogram. The embodiment of the invention can output the voice with the appointed voice style attribute.

Description

Speech generation method, system and computer equipment
Technical Field
Embodiments of the present invention relate to the field of speech synthesis, and in particular, to a speech generating method, a system, a computer device, and a computer readable storage medium.
Background
The speech synthesis technology is an important capability in the field of artificial intelligence, is more natural and emotional synthetic speech, can greatly improve the service experience of users, and represents the highest level of development for artificial intelligence. In practical applications, during the interaction with the user, the synthesized speech usually keeps presenting a fixed style of synthesis effect, and the user experience is very poor. Because most current speech synthesis systems are based on TTS models for speech training of training data sets, only a fixed style of synthesized speech can be output.
Therefore, in order to control the computer device to output voice data in a designated voice mode in the intelligent voice dialogue, the working efficiency of the business process is further improved, and the method is one of the technical problems to be solved at present.
Disclosure of Invention
In view of the foregoing, there is a need for a speech generating method, system, computer device and computer readable storage medium, which solve the technical problem of single style speech style synthesized by the current speech synthesis system.
To achieve the above object, an embodiment of the present invention provides a method for generating speech, including:
acquiring user audio data and converting the user audio data into a user voice spectrogram;
extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;
acquiring audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited;
generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and
And generating a voice signal for output according to the target voice spectrogram.
Illustratively, converting the user audio data into a user speech spectrogram comprises:
extracting user spectrum information of the user audio data;
generating a first waveform diagram corresponding to a time domain according to the user frequency spectrum information;
carrying out frame division processing on the first oscillogram to obtain a plurality of first single-frame oscillograms;
performing Fourier transform operation on each first single-frame waveform diagram to obtain a plurality of first single-frame frequency spectrograms, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing amplitude;
performing inversion operation and gray scale operation on each first single-frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single-frame frequency spectrogram after the inversion operation through gray scale values; and
And synthesizing the plurality of first one-dimensional gray scale amplitude graphs to obtain the user voice spectrogram.
Illustratively, extracting the user voice attribute corresponding to the user audio data from the user voice spectrogram includes:
Extracting voice attributes of the user voice spectrogram through a target generator to obtain the user voice attributes corresponding to the user audio data;
the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.
Illustratively, the method further comprises the training step of the GAN model:
acquiring a sample spectrogram and a sample attribute label corresponding to the sample spectrogram, wherein the sample spectrogram comprises a voice spectrogram;
inputting the sample spectrogram and the sample attribute tag into a GAN model;
determining a sample attribute region to which the sample spectrogram belongs through the spatial attention network;
inputting a sample spectrogram and the sample attribute label in the sample attribute area into the attribute editing network to obtain a generated spectrogram corresponding to the sample spectrogram;
inputting the sample spectrogram and the generated spectrogram into a discriminator of the GAN model, and judging whether the generated spectrogram accords with the graph distribution of the voice spectrogram or not through a true and false classifier in the discriminator;
If the generated spectrogram accords with the graph distribution of the user voice spectrogram, predicting sample voice attributes of the voice spectrogram through an attribute classifier in the discriminator; and
And comparing the attribute difference between the sample voice attribute and the sample attribute label, and adjusting the parameters of the GAN model according to the attribute difference to obtain a target GAN model.
Exemplary, the obtaining the audio data to be edited and generating the second spectrum information into the voice spectrogram to be edited includes:
extracting spectral information to be edited of the audio data to be edited;
generating a second waveform diagram corresponding to a time domain according to the frequency spectrum information to be edited;
carrying out frame division processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams;
performing Fourier transform operation on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, wherein the horizontal axis of each second single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each second single-frame frequency spectrogram is used for representing amplitude;
performing inversion operation and gray scale operation on each second single-frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the second single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single-frame frequency spectrogram after the inversion operation through gray scale values; and
And synthesizing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited.
Illustratively, generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited, including:
acquiring a target voice attribute corresponding to the user voice attribute according to the user voice attribute and the mapping relation diagram; and
And inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram.
Illustratively, inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram, including:
determining a target attribute area to which the voice spectrogram to be edited belongs through the spatial attention network;
inputting the voice spectrogram to be edited in the target attribute area and the target voice attribute into the attribute editing network to obtain the target voice spectrogram, wherein the target voice spectrogram is the voice spectrogram to be edited carrying the target voice attribute.
To achieve the above object, an embodiment of the present invention further provides a speech generating system, including:
The first acquisition module is used for acquiring user audio data and converting the user audio data into a user voice spectrogram;
the attribute extraction module is used for extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;
the second acquisition module is used for acquiring the audio data to be edited and converting the audio data to be edited into a voice spectrogram to be edited;
the voice editing acquisition module is used for generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and
And the voice generation module is used for generating a voice signal for output according to the target voice spectrogram.
To achieve the above object, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the speech generating method as described above when being executed by the processor.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the steps of the speech generating method as described above.
The voice generation method, the voice generation system, the computer equipment and the computer readable storage medium provided by the embodiment of the invention provide an effective voice generation method for the voice synthesis style attribute; the invention can analyze the user voice style attribute of the user voice, edit the voice to be edited corresponding to the user voice according to the user voice style attribute, so that the voice to be edited has the user voice style attribute, and realize the output of the voice with the appointed voice style by the appointed voice style attribute.
Drawings
Fig. 1 is a flowchart of a voice generating method according to an embodiment of the present invention.
Fig. 2 is a schematic flowchart of step S102 in fig. 1.
Fig. 3 is a user voice spectrogram of a voice generating method according to an embodiment of the invention.
Fig. 4 is a first waveform diagram of a voice generating method according to an embodiment of the invention.
Fig. 5 is a fourier transform operation diagram of a voice generating method according to an embodiment of the invention.
Fig. 6 is a reverse operation diagram of the voice generating method according to the embodiment of the invention.
Fig. 7 is a gray scale operation chart of the voice generating method according to the embodiment of the invention.
Fig. 8 is a schematic diagram illustrating a specific flow of step S104 in fig. 1.
Fig. 9 is a schematic diagram illustrating a specific flow of step S106 in fig. 1.
Fig. 10 is a schematic diagram illustrating a specific flow of step S106b in fig. 9.
Fig. 11 is a schematic diagram of a program module of a second embodiment of the speech generating system according to the present invention.
Fig. 12 is a schematic diagram of a hardware structure of a third embodiment of the computer device of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
In the following embodiments, an exemplary description will be made with the computer device 2 as an execution subject.
Example 1
Referring to FIG. 1, a flowchart of the steps of a speech generation method according to an embodiment of the present invention is shown. It will be appreciated that the flow charts in the method embodiments are not intended to limit the order in which the steps are performed. An exemplary description will be made below with the computer device 2 as an execution subject. Specifically, the following is described.
Step S100, user audio data are acquired, and the user audio data are converted into a user voice spectrogram.
The user audio data refers to audio information collected or stored by the user terminal, the audio information can be frequency and amplitude change information of a section of voice, sound effect and/or music, and can also be a signal corresponding to a section of sound recorded by the user at the user terminal, for example, the user audio data can be obtained from a voice call, and the voice call can be a mobile phone call, a WeChat call, a video call and the like; the user audio data is audio data generated by the user in the voice call. And acquiring user audio data of the user when the user is in communication, and converting the user audio data into a voice frequency spectrogram.
For example, as shown in fig. 2, the step S100 may further include:
step S100a, extracting user spectrum information of the user audio data.
Step S100b, generating a first waveform diagram corresponding to a time domain according to the user spectrum information.
Step S100c, performing frame division processing on the first waveform diagram to obtain a plurality of first single-frame waveform diagrams.
In step S100d, fourier transform is performed on each of the first single-frame waveform diagrams to obtain a plurality of first single-frame frequency spectrograms, where a horizontal axis of each of the first single-frame frequency spectrograms is used to represent frequency, and a vertical axis of each of the first single-frame frequency spectrograms is used to represent amplitude.
Step S100e, performing an inversion operation and a gray scale operation on each first single frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, where the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single frame frequency spectrogram after the inversion operation by using a gray scale value.
Step S100f, synthesizing the plurality of first one-dimensional gray scale amplitude diagrams to obtain the user voice spectrogram.
As shown in fig. 3-7, the user speech spectrum (spectrum) is an image reflecting the relationship between signal frequency and energy, and the first waveform (Wave) is a continuous segment of sound waveform signal generated from the user spectral information. In the embodiment of the invention, the user voice spectrogram can be obtained by processing the user frequency spectrum information. For example, firstly converting the user Spectrum information into a first waveform diagram corresponding to the user Spectrum information time domain, dividing the first waveform diagram into a plurality of first single-frame waveform diagrams with equal duration, continuously sampling each first single-frame waveform diagram to obtain a plurality of sampling points, performing FFT (fourier transform) operation on the plurality of sampling points to obtain a plurality of first single-frame frequency spectrograms (spectra), and performing inversion operation and gray operation on each first single-frame frequency spectrogram to obtain a first one-dimensional gray Amplitude diagram, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing Amplitude (Amplitude); and finally, splicing the plurality of first one-dimensional gray level amplitude graphs to obtain a user voice spectrogram corresponding to the user frequency spectrum information. For example, when the plurality of sampling points is 4096 sampling points, the duration of each first single-frame waveform chart is 1/10 second(s), and the value corresponding to each point in the user voice spectrogram corresponding to the first waveform chart is the amplitude of the corresponding frequency. Therefore, the user voice spectrogram corresponding to the user frequency spectrum information reflects the frequency distribution condition of the audio in time.
Step S102, extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes.
Extracting user voice attributes, such as style attributes, through a voice spectrogram: happiness, anger, etc., but may also be other attributes such as speech rate, gender, etc.
Illustratively, the step S102 may further include: extracting voice attributes of the user voice spectrogram through a target generator to obtain the user voice attributes corresponding to the user audio data;
the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.
Step S104, obtaining the audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited.
The audio data to be edited can be obtained from a voice call, wherein the voice call can be a mobile phone call, a WeChat call, a video call and the like; the audio data to be edited is the audio data generated by the user call object in the voice call. For example, audio data to be edited of a user object is acquired when the user is talking, and the user audio data is converted into a voice spectrogram.
For example, as shown in fig. 8, the step S104 may further include:
step S104a, extracting the frequency spectrum information to be edited of the audio data to be edited.
Step S104b, a second waveform diagram corresponding to the time domain is generated according to the frequency spectrum information to be edited.
Step S104c, carrying out framing processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams.
In step S104d, fourier transform is performed on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, where a horizontal axis of each second single-frame frequency spectrogram is used to represent frequency, and a vertical axis of each second single-frame frequency spectrogram is used to represent amplitude.
Step S104e, performing inversion operation and gray scale operation on each second single frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging the horizontal axis and the vertical axis in the second single frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single frame frequency spectrogram after the inversion operation through gray scale values.
Step S104f, synthesizing the plurality of second one-dimensional gray scale amplitude diagrams to obtain a voice spectrogram to be edited.
The voice spectrogram to be edited is an image reflecting the relation between signal frequency and energy, and the second waveform chart is a section of continuous sound waveform signal chart generated according to the frequency spectrum information to be edited. In the embodiment of the invention, the voice spectrogram to be edited can be obtained by processing the frequency spectrum information to be edited. For example, the spectral information to be edited is first converted into a second waveform diagram corresponding to the time domain of the spectral information to be edited, the second waveform diagram is divided into a plurality of second single-frame waveform diagrams with equal duration, each of the second single-frame waveform diagrams is continuously sampled to obtain a plurality of sampling points, then FFT (fourier transform) operation is performed on the plurality of sampling points to obtain a plurality of second single-frame frequency spectrograms, inversion operation and gray operation are performed on each of the second single-frame frequency spectrograms to obtain a second one-dimensional gray amplitude diagram, wherein a horizontal axis of each of the second single-frame frequency spectrograms is used for representing frequency, and a vertical axis of each of the second single-frame frequency spectrograms is used for representing amplitude; and finally, splicing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited, which corresponds to the frequency spectrum information to be edited. For example, when the plurality of sampling points is 4096 sampling points, the duration of each second single-frame waveform chart is 1/10 second(s), and the value corresponding to each point in the voice spectrogram to be edited corresponding to the second waveform chart is the amplitude of the corresponding frequency. Therefore, the frequency distribution of the audio in time is reflected by the voice spectrogram to be edited corresponding to the frequency spectrum information to be edited.
Step S106, generating a target voice spectrogram according to the voice attribute of the user and the voice spectrogram to be edited.
In order to better interact with the user, when the user voice attribute is anger, combining the gentle user voice attribute with the voice spectrogram to be edited to generate a target voice spectrogram with the gentle attribute, namely combining the context, and carrying out interactive dialogue with the user more truly and naturally with emotional personification infectivity.
As shown in fig. 9, the step S106 may further include:
step S106a, according to the user voice attribute and the mapping relation diagram, obtaining a target voice attribute corresponding to the user voice attribute.
Step S106b, inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram.
Exemplary, the step of configuring the mapping relationship map includes: inputting a plurality of real voice spectrograms into a target generator to obtain one or more voice attributes corresponding to each user voice spectrogram; generating a mapping relation diagram between one or more voice attributes and another designated voice attribute, and storing the mapping relation to a database.
Illustratively, the speech generation model determines a target speech style matched with the call object language according to the speech style of the call object language in the call, and generates a final speech spectrogram of the target speech style matched with the call object language.
For example, as shown in fig. 10, the step S106b may further include:
step S106b1, determining a target attribute area to which the voice spectrogram to be edited belongs through the spatial attention network;
step S106b2, inputting the voice spectrogram to be edited and the target voice attribute in the target attribute area into the attribute editing network to obtain the target voice spectrogram, where the target voice spectrogram is the voice spectrogram to be edited carrying the target voice attribute.
In some embodiments, the target generator G may convert the input voice spectrogram I to be edited into an edited target voice spectrogram on the condition of the target voice attribute cFor example, a->The target generator G comprises two parts, namely a neural network F with image style attribute migration capability for the attribute editing network m For example, the property editing network may generate a spectrogram C with the attribute style in the spectrogram a and the spectrogram B according to the spectrogram a of the given content and the spectrogram B with the attribute style; the spatial attention network is a convolutional neural network F with attention capability a The method comprises the steps of carrying out a first treatment on the surface of the Wherein the property editing network focuses on how to edit, and the spatial attention network focuses on where to edit. For example, the property editing network takes the voice spectrogram I to be edited and the target voice property c as inputs and outputs an edited target voice spectrogram I a For example, I a =F m (I, c); the spatial attention network takes a voice spectrogram I to be edited as input, predicts a spatial attention mask for limiting the operation of the attribute editing network to a target attribute region: b=f a (I) The method comprises the steps of carrying out a first treatment on the surface of the Ideally, the attention value of the style attribute related region in b should be 1, and the other regions 0. In practice, notes passing through style attributesThe meaning value is a continuous value between 0 and 1. Thus, a region with an attention value other than 0 is regarded as a style attribute related region, and the remaining regions with an attention value of 0 are regarded as style attribute independent regions.
Step S108, generating a voice signal for output according to the target voice spectrogram.
And outputting the speech signal with the appointed speech style reconstructed by the target speech spectrogram through a signal reconstruction speech algorithm.
Illustratively, the method further comprises training steps (1) - (7) of the GAN model.
The GAN model also includes a discriminator including a true-false classifier D src And attribute classifier D cls The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the true and false classifier D src And the attribute classifier D cls Are convolutional neural networks CNN with Softmax functions. The true and false classifier D src And attribute classifier D cls A portion of the convolutional layer at the beginning may be shared, followed by a different fully-connected layer, for different classifications. For example, the output D of the true-false classifier src (I) Representing the probability that the speech spectrogram I to be edited is true, and the output D of the attribute classifier cls (c|I) represents the probability that the voice spectrogram I to be edited has a sound style attribute c, wherein c epsilon 0,1 is binary, and when c is 1, the voice spectrogram I to be edited contains the sound style attribute c, and when c is 0, the voice spectrogram I to be edited does not contain the sound style attribute c. The input voice spectrogram to be edited can be a real voice conversion obtained voice spectrogram or a machine-generated voice conversion obtained voice spectrogram.
(1) Acquiring a sample spectrogram and a sample attribute label corresponding to the sample spectrogram, wherein the sample spectrogram comprises a voice spectrogram; (2) Inputting the sample spectrogram and the sample attribute tag into a GAN model; (3) Determining a sample attribute region to which the sample spectrogram belongs through the spatial attention network; (4) Inputting a sample spectrogram and the sample attribute label in the sample attribute area into the attribute editing network to obtain a generated spectrogram corresponding to the sample spectrogram; (5) Inputting the sample spectrogram and the generated spectrogram into a discriminator of the GAN model, and judging whether the generated spectrogram accords with the graph distribution of the voice spectrogram or not through a true and false classifier in the discriminator; (6) If the generated spectrogram accords with the graph distribution of the user voice spectrogram, predicting sample voice attributes of the voice spectrogram through an attribute classifier in the discriminator; and (7) comparing the attribute difference between the sample voice attribute and the sample attribute label, and adjusting the parameters of the GAN model according to the attribute difference to obtain a target GAN model.
In some embodiments, the GAN model is under the direction of a spatial attention network (attention mask) and is based on a final edited target speech spectrogramIn which the style attribute related region is edited towards the target style attribute while the other regions remain unchanged:
in order to make the edited target voice spectrogramMore closely to the true speech spectrum, the true-false classifier can be tuned here by an opposing loss (absolute loss) function:
in order to makeCorrectly with the target style attribute c, a style attribute classification loss function is used to drive the attribute classifier about +.>Is close to the target value c:
to keep the sound style attribute independent, a reconstruction loss (reconstruction loss) function is used:
wherein c g Is the original style attribute of input voice spectrogram I to be edited, lambda 1 And lambda (lambda) 2 Is two equilibrium parameters. Wherein lambda is 1 (dual reconstruction loss) the purpose of the method is to make the edited target speech spectrogramSimilar to the voice spectrogram I to be edited; lambda (lambda) 2 (identity reconstruction loss) the purpose of the method is to make the input speech spectrogram I to be edited have its own sound style attribute c g Editing is performed without modification.
Finally, optimizing the generator G:
for the entire such countermeasure generation network GAN model with spatial attention network, the generator G and the arbiter D can be trained in a countermeasure manner.
The GAN model also comprises a discriminator including a true-false classifier D src And attribute classifier D cls . Wherein the loss function of the optimized true/false (real/false) classifier is a standard cross-entropy loss (standard cross-entropy loss) function:
wherein I is a voice spectrogram to be edited,is the target speech spectrogram.
The loss function of the optimized attribute classifier is also a standard cross entropy loss:
wherein c g Is the manual labeling style attribute of the voice spectrogram I to be edited.
The overall loss function of the arbiter D can be expressed as:
by minimizing the loss function, the obtained discriminator D can well separate the voice spectrogram to be edited from the target voice spectrogram and correctly predictThe probability of containing c.
Example two
Fig. 11 is a schematic diagram of a program module of a second embodiment of the speech generating system according to the present invention. The speech generating system 20 may include or be partitioned into one or more program modules that are stored in a storage medium and executed by one or more processors to perform the present invention and to implement the speech generating methods described above. Program modules depicted in the embodiments of the present invention are directed to a series of computer program instruction segments capable of performing the specified functions, which are more suitable than the program itself for describing the execution of speech generating system 20 in a storage medium. The following description will specifically describe functions of each program module of the present embodiment:
The first obtaining module 200 is configured to obtain user audio data, and convert the user audio data into a user voice spectrogram.
Illustratively, the determining module 200 is further configured to: extracting user spectrum information of the user audio data;
generating a first waveform diagram corresponding to a time domain according to the user frequency spectrum information; carrying out frame division processing on the first oscillogram to obtain a plurality of first single-frame oscillograms; performing Fourier transform operation on each first single-frame waveform diagram to obtain a plurality of first single-frame frequency spectrograms, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing amplitude; performing inversion operation and gray scale operation on each first single-frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single-frame frequency spectrogram after the inversion operation through gray scale values; and synthesizing the plurality of first one-dimensional gray scale amplitude diagrams to obtain the user voice spectrogram.
The attribute extraction module 202 is configured to extract, from the user voice spectrogram, a user voice attribute corresponding to the user audio data, where the user voice attribute includes a style attribute.
Illustratively, the attribute extraction module 202 is further configured to: analyzing the user voice spectrogram through a GAN model to obtain user voice attributes of the user audio data; the GAN model includes a generator including a spatial attention network and a property editing network, and a discriminator including a true-false classifier and a property classifier
Illustratively, the attribute extraction module 202 is further configured to: determining a target attribute area to which the user voice spectrogram belongs through the spatial attention network; inputting the user voice spectrogram in the target attribute area into the attribute editing network to obtain a generated voice spectrogram with user voice attributes; inputting the generated voice spectrogram and the user voice spectrogram into the discriminator, and judging whether the generated voice spectrogram accords with the graphic distribution of the user voice spectrogram or not through a true-false classifier in the discriminator; and if the generated voice spectrogram accords with the graph distribution of the user voice spectrogram, predicting the voice attribute of the user voice spectrogram through an attribute classifier in the discriminator to obtain the user voice attribute.
The second obtaining module 204 is configured to obtain audio data to be edited, and convert the audio data to be edited into a voice spectrogram to be edited;
illustratively, the second obtaining module 204 is further configured to: extracting spectral information to be edited of the audio data to be edited; generating a second waveform diagram corresponding to a time domain according to the frequency spectrum information to be edited; carrying out frame division processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams; performing Fourier transform operation on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, wherein the horizontal axis of each second single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each second single-frame frequency spectrogram is used for representing amplitude; performing inversion operation and gray scale operation on each second single-frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the second single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single-frame frequency spectrogram after the inversion operation through gray scale values; and synthesizing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited.
The voice editing module 206 is configured to generate a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited.
Illustratively, the voice editing module 206 is further configured to: acquiring voice attributes mapped with the user voice attributes according to the user voice attributes and the mapping relation diagram, and determining target voice attributes corresponding to the user voice attributes; editing the voice spectrogram to be edited according to the target voice attribute to obtain a target voice spectrogram
A voice generating module 208, configured to generate a voice signal for output according to the target voice spectrogram.
Example III
Referring to fig. 12, a hardware architecture diagram of a computer device according to a third embodiment of the invention is shown. In this embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server, or a rack server (including a stand-alone server, or a server cluster made up of multiple servers), or the like. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a speech generating system 20 communicatively coupled to each other via a system bus.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 2. Of course, the memory 21 may also include both internal storage units of the computer device 2 and external storage devices. In this embodiment, the memory 21 is typically used to store an operating system and various types of application software installed on the computer device 2, such as program codes of the speech generating system 20 of the second embodiment. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.
The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, execute the speech generating system 20, to implement the speech generating method of the first embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, which network interface 23 is typically used for establishing a communication connection between the computer apparatus 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or other wireless or wired network.
It is noted that fig. 12 only shows a computer device 2 having components 20-23, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.
In the present embodiment, the speech generating system 20 stored in the memory 21 may also be divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (the processor 22 in the present embodiment) to complete the present invention.
For example, fig. 11 shows a schematic diagram of a program module for implementing the speech generating system 20 according to the second embodiment of the present invention, where the speech generating system 20 may be divided into a first obtaining module 200, an attribute extraction module 202, a second obtaining module 204, a speech editing module 206 and a speech generating module 208. Program modules in the present invention are understood to mean a series of computer program instruction segments, which can perform a specific function, more appropriately than a program, describing the execution of the speech generating system 20 in the computer device 2. The specific functions of the program modules 200-208 are described in detail in the second embodiment, and are not described herein.
Example IV
The present embodiment also provides a computer-readable storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs the corresponding functions. The computer-readable storage medium of the present embodiment is used in the speech generating system 20, and when executed by a processor, implements the speech generating method of the first embodiment.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (9)

1. A method of speech generation, the method comprising:
acquiring user audio data and converting the user audio data into a user voice spectrogram;
extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;
acquiring audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited;
inputting the voice attribute of the user and the voice spectrogram to be edited into a target generator to generate a target voice spectrogram; and
Generating a voice signal for output according to the target voice spectrogram;
the extracting the user voice attribute corresponding to the user audio data from the user voice spectrogram includes:
extracting voice attributes of the user voice spectrogram through the target generator to obtain the user voice attributes corresponding to the user audio data;
the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.
2. The speech generating method of claim 1, wherein converting the user audio data into a user speech spectrogram comprises:
extracting user spectrum information of the user audio data;
generating a first waveform diagram corresponding to a time domain according to the user frequency spectrum information;
carrying out frame division processing on the first oscillogram to obtain a plurality of first single-frame oscillograms;
performing Fourier transform operation on each first single-frame waveform diagram to obtain a plurality of first single-frame frequency spectrograms, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing amplitude;
performing inversion operation and gray scale operation on each first single-frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single-frame frequency spectrogram after the inversion operation through gray scale values; and
And synthesizing the plurality of first one-dimensional gray scale amplitude graphs to obtain the user voice spectrogram.
3. The speech generation method of claim 1, wherein the method further comprises the training step of the GAN model:
Acquiring a sample spectrogram and a sample attribute label corresponding to the sample spectrogram, wherein the sample spectrogram comprises a voice spectrogram;
inputting the sample spectrogram and the sample attribute tag into a GAN model;
determining a sample attribute region to which the sample spectrogram belongs through the spatial attention network;
inputting a sample spectrogram and the sample attribute label in the sample attribute area into the attribute editing network to obtain a generated spectrogram corresponding to the sample spectrogram;
inputting the sample spectrogram and the generated spectrogram into a discriminator of the GAN model, and judging whether the generated spectrogram accords with the graph distribution of the voice spectrogram or not through a true and false classifier in the discriminator;
if the generated spectrogram accords with the graph distribution of the user voice spectrogram, predicting sample voice attributes of the voice spectrogram through an attribute classifier in the discriminator; and
And comparing the attribute difference between the sample voice attribute and the sample attribute label, and adjusting the parameters of the GAN model according to the attribute difference to obtain a target GAN model.
4. The method of claim 1, wherein the obtaining audio data to be edited and converting the audio data to be edited into a speech spectrogram comprises:
Extracting spectral information to be edited of the audio data to be edited;
generating a second waveform diagram corresponding to a time domain according to the frequency spectrum information to be edited;
carrying out frame division processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams;
performing Fourier transform operation on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, wherein the horizontal axis of each second single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each second single-frame frequency spectrogram is used for representing amplitude;
performing inversion operation and gray scale operation on each second single-frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the second single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single-frame frequency spectrogram after the inversion operation through gray scale values; and
And synthesizing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited.
5. The speech generating method of claim 1, wherein generating a target speech spectrogram from the user speech attribute and the speech spectrogram to be edited comprises:
acquiring a target voice attribute corresponding to the user voice attribute according to the user voice attribute and the mapping relation diagram; and
And inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram.
6. The speech generating method of claim 5, wherein inputting the target speech attribute and the speech spectrogram to be edited into a target generator to obtain a target speech spectrogram comprises:
determining a target attribute area to which the voice spectrogram to be edited belongs through the spatial attention network;
inputting the voice spectrogram to be edited in the target attribute area and the target voice attribute into the attribute editing network to obtain the target voice spectrogram, wherein the target voice spectrogram is the voice spectrogram to be edited carrying the target voice attribute.
7. A speech generation system, comprising:
the first acquisition module is used for acquiring user audio data and converting the user audio data into a user voice spectrogram;
the attribute extraction module is used for extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;
The second acquisition module is used for acquiring the audio data to be edited and converting the audio data to be edited into a voice spectrogram to be edited;
the voice editing acquisition module is used for inputting the voice spectrogram to be edited into a target generator according to the voice attribute of the user and the voice spectrogram to be edited to generate a target voice spectrogram; and
The voice generation module is used for generating a voice signal for output according to the target voice spectrogram;
the attribute extraction module is further configured to extract, by using the target generator, a voice attribute of the user voice spectrogram, so as to obtain the user voice attribute corresponding to the user audio data; the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the speech generating method according to any of claims 1 to 6.
9. A computer-readable storage medium, in which a computer program is stored, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the speech generating method according to any one of claims 1 to 6.
CN202010052356.5A 2020-01-17 2020-01-17 Speech generation method, system and computer equipment Active CN111276119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010052356.5A CN111276119B (en) 2020-01-17 2020-01-17 Speech generation method, system and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010052356.5A CN111276119B (en) 2020-01-17 2020-01-17 Speech generation method, system and computer equipment

Publications (2)

Publication Number Publication Date
CN111276119A CN111276119A (en) 2020-06-12
CN111276119B true CN111276119B (en) 2023-08-22

Family

ID=71001048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010052356.5A Active CN111276119B (en) 2020-01-17 2020-01-17 Speech generation method, system and computer equipment

Country Status (1)

Country Link
CN (1) CN111276119B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111768756B (en) * 2020-06-24 2023-10-20 华人运通(上海)云计算科技有限公司 Information processing method, information processing device, vehicle and computer storage medium
CN112185338B (en) * 2020-09-30 2024-01-23 北京大米科技有限公司 Audio processing method, device, readable storage medium and electronic equipment
CN112699726B (en) * 2020-11-11 2023-04-07 中国科学院计算技术研究所数字经济产业研究院 Image enhancement method, genuine-fake commodity identification method and equipment
CN112562728B (en) * 2020-11-13 2024-06-18 百果园技术(新加坡)有限公司 Method for generating countermeasure network training, method and device for audio style migration
CN114299969B (en) * 2021-08-19 2024-06-11 腾讯科技(深圳)有限公司 Audio synthesis method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110189766A (en) * 2019-06-14 2019-08-30 西南科技大学 A kind of voice style transfer method neural network based
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN109817246A (en) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110189766A (en) * 2019-06-14 2019-08-30 西南科技大学 A kind of voice style transfer method neural network based
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion

Also Published As

Publication number Publication date
CN111276119A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
CN111276119B (en) Speech generation method, system and computer equipment
US10553201B2 (en) Method and apparatus for speech synthesis
CN110335587B (en) Speech synthesis method, system, terminal device and readable storage medium
CN107481717B (en) Acoustic model training method and system
US10255903B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN111444382B (en) Audio processing method and device, computer equipment and storage medium
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
CN110930975B (en) Method and device for outputting information
WO2024055752A9 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN116092503B (en) Fake voice detection method, device, equipment and medium combining time domain and frequency domain
CN114400005A (en) Voice message generation method and device, computer equipment and storage medium
CN113782042B (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN115294947A (en) Audio data processing method and device, electronic equipment and medium
CN111862931B (en) Voice generation method and device
CN113012706B (en) Data processing method and device and electronic equipment
CN117877517B (en) Method, device, equipment and medium for generating environmental sound based on antagonistic neural network
CN114141259A (en) Voice conversion method, device, equipment, storage medium and program product
CN118447820A (en) Voice conversion method, device, equipment and medium based on style
CN112750423A (en) Method, device and system for constructing personalized speech synthesis model and electronic equipment
CN118193713A (en) Knowledge question-answering method and device based on virtual digital expert
CN115762472A (en) Voice rhythm identification method, system, equipment and storage medium
CN115547291A (en) Speech synthesis method, apparatus, electronic device and storage medium
CN115985287A (en) Speech synthesis method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant