CN111276119B

CN111276119B - Speech generation method, system and computer equipment

Info

Publication number: CN111276119B
Application number: CN202010052356.5A
Authority: CN
Inventors: 马坤; 赵之砚; 施奕明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2023-08-22
Anticipated expiration: 2040-01-17
Also published as: CN111276119A

Abstract

The embodiment of the invention provides a voice generation method, which comprises the following steps: acquiring user audio data and converting the user audio data into a user voice spectrogram; extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes; acquiring audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited; generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and generating a voice signal for output according to the target voice spectrogram. The embodiment of the invention can output the voice with the appointed voice style attribute.

Description

Speech generation method, system and computer equipment

Technical Field

Embodiments of the present invention relate to the field of speech synthesis, and in particular, to a speech generating method, a system, a computer device, and a computer readable storage medium.

Background

The speech synthesis technology is an important capability in the field of artificial intelligence, is more natural and emotional synthetic speech, can greatly improve the service experience of users, and represents the highest level of development for artificial intelligence. In practical applications, during the interaction with the user, the synthesized speech usually keeps presenting a fixed style of synthesis effect, and the user experience is very poor. Because most current speech synthesis systems are based on TTS models for speech training of training data sets, only a fixed style of synthesized speech can be output.

Therefore, in order to control the computer device to output voice data in a designated voice mode in the intelligent voice dialogue, the working efficiency of the business process is further improved, and the method is one of the technical problems to be solved at present.

Disclosure of Invention

In view of the foregoing, there is a need for a speech generating method, system, computer device and computer readable storage medium, which solve the technical problem of single style speech style synthesized by the current speech synthesis system.

To achieve the above object, an embodiment of the present invention provides a method for generating speech, including:

acquiring user audio data and converting the user audio data into a user voice spectrogram;

extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;

acquiring audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited;

generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and

And generating a voice signal for output according to the target voice spectrogram.

Illustratively, converting the user audio data into a user speech spectrogram comprises:

extracting user spectrum information of the user audio data;

generating a first waveform diagram corresponding to a time domain according to the user frequency spectrum information;

carrying out frame division processing on the first oscillogram to obtain a plurality of first single-frame oscillograms;

performing Fourier transform operation on each first single-frame waveform diagram to obtain a plurality of first single-frame frequency spectrograms, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing amplitude;

performing inversion operation and gray scale operation on each first single-frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single-frame frequency spectrogram after the inversion operation through gray scale values; and

And synthesizing the plurality of first one-dimensional gray scale amplitude graphs to obtain the user voice spectrogram.

Illustratively, extracting the user voice attribute corresponding to the user audio data from the user voice spectrogram includes:

Extracting voice attributes of the user voice spectrogram through a target generator to obtain the user voice attributes corresponding to the user audio data;

the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.

Illustratively, the method further comprises the training step of the GAN model:

acquiring a sample spectrogram and a sample attribute label corresponding to the sample spectrogram, wherein the sample spectrogram comprises a voice spectrogram;

inputting the sample spectrogram and the sample attribute tag into a GAN model;

determining a sample attribute region to which the sample spectrogram belongs through the spatial attention network;

inputting a sample spectrogram and the sample attribute label in the sample attribute area into the attribute editing network to obtain a generated spectrogram corresponding to the sample spectrogram;

inputting the sample spectrogram and the generated spectrogram into a discriminator of the GAN model, and judging whether the generated spectrogram accords with the graph distribution of the voice spectrogram or not through a true and false classifier in the discriminator;

If the generated spectrogram accords with the graph distribution of the user voice spectrogram, predicting sample voice attributes of the voice spectrogram through an attribute classifier in the discriminator; and

And comparing the attribute difference between the sample voice attribute and the sample attribute label, and adjusting the parameters of the GAN model according to the attribute difference to obtain a target GAN model.

Exemplary, the obtaining the audio data to be edited and generating the second spectrum information into the voice spectrogram to be edited includes:

extracting spectral information to be edited of the audio data to be edited;

generating a second waveform diagram corresponding to a time domain according to the frequency spectrum information to be edited;

carrying out frame division processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams;

performing Fourier transform operation on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, wherein the horizontal axis of each second single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each second single-frame frequency spectrogram is used for representing amplitude;

performing inversion operation and gray scale operation on each second single-frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the second single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single-frame frequency spectrogram after the inversion operation through gray scale values; and

And synthesizing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited.

Illustratively, generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited, including:

acquiring a target voice attribute corresponding to the user voice attribute according to the user voice attribute and the mapping relation diagram; and

And inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram.

Illustratively, inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram, including:

determining a target attribute area to which the voice spectrogram to be edited belongs through the spatial attention network;

inputting the voice spectrogram to be edited in the target attribute area and the target voice attribute into the attribute editing network to obtain the target voice spectrogram, wherein the target voice spectrogram is the voice spectrogram to be edited carrying the target voice attribute.

To achieve the above object, an embodiment of the present invention further provides a speech generating system, including:

The first acquisition module is used for acquiring user audio data and converting the user audio data into a user voice spectrogram;

the attribute extraction module is used for extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes;

the second acquisition module is used for acquiring the audio data to be edited and converting the audio data to be edited into a voice spectrogram to be edited;

the voice editing acquisition module is used for generating a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited; and

And the voice generation module is used for generating a voice signal for output according to the target voice spectrogram.

To achieve the above object, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the speech generating method as described above when being executed by the processor.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the steps of the speech generating method as described above.

The voice generation method, the voice generation system, the computer equipment and the computer readable storage medium provided by the embodiment of the invention provide an effective voice generation method for the voice synthesis style attribute; the invention can analyze the user voice style attribute of the user voice, edit the voice to be edited corresponding to the user voice according to the user voice style attribute, so that the voice to be edited has the user voice style attribute, and realize the output of the voice with the appointed voice style by the appointed voice style attribute.

Drawings

Fig. 1 is a flowchart of a voice generating method according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart of step S102 in fig. 1.

Fig. 3 is a user voice spectrogram of a voice generating method according to an embodiment of the invention.

Fig. 4 is a first waveform diagram of a voice generating method according to an embodiment of the invention.

Fig. 5 is a fourier transform operation diagram of a voice generating method according to an embodiment of the invention.

Fig. 6 is a reverse operation diagram of the voice generating method according to the embodiment of the invention.

Fig. 7 is a gray scale operation chart of the voice generating method according to the embodiment of the invention.

Fig. 8 is a schematic diagram illustrating a specific flow of step S104 in fig. 1.

Fig. 9 is a schematic diagram illustrating a specific flow of step S106 in fig. 1.

Fig. 10 is a schematic diagram illustrating a specific flow of step S106b in fig. 9.

Fig. 11 is a schematic diagram of a program module of a second embodiment of the speech generating system according to the present invention.

Fig. 12 is a schematic diagram of a hardware structure of a third embodiment of the computer device of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

In the following embodiments, an exemplary description will be made with the computer device 2 as an execution subject.

Example 1

Referring to FIG. 1, a flowchart of the steps of a speech generation method according to an embodiment of the present invention is shown. It will be appreciated that the flow charts in the method embodiments are not intended to limit the order in which the steps are performed. An exemplary description will be made below with the computer device 2 as an execution subject. Specifically, the following is described.

Step S100, user audio data are acquired, and the user audio data are converted into a user voice spectrogram.

The user audio data refers to audio information collected or stored by the user terminal, the audio information can be frequency and amplitude change information of a section of voice, sound effect and/or music, and can also be a signal corresponding to a section of sound recorded by the user at the user terminal, for example, the user audio data can be obtained from a voice call, and the voice call can be a mobile phone call, a WeChat call, a video call and the like; the user audio data is audio data generated by the user in the voice call. And acquiring user audio data of the user when the user is in communication, and converting the user audio data into a voice frequency spectrogram.

For example, as shown in fig. 2, the step S100 may further include:

step S100a, extracting user spectrum information of the user audio data.

Step S100b, generating a first waveform diagram corresponding to a time domain according to the user spectrum information.

Step S100c, performing frame division processing on the first waveform diagram to obtain a plurality of first single-frame waveform diagrams.

In step S100d, fourier transform is performed on each of the first single-frame waveform diagrams to obtain a plurality of first single-frame frequency spectrograms, where a horizontal axis of each of the first single-frame frequency spectrograms is used to represent frequency, and a vertical axis of each of the first single-frame frequency spectrograms is used to represent amplitude.

Step S100e, performing an inversion operation and a gray scale operation on each first single frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, where the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single frame frequency spectrogram after the inversion operation by using a gray scale value.

Step S100f, synthesizing the plurality of first one-dimensional gray scale amplitude diagrams to obtain the user voice spectrogram.

As shown in fig. 3-7, the user speech spectrum (spectrum) is an image reflecting the relationship between signal frequency and energy, and the first waveform (Wave) is a continuous segment of sound waveform signal generated from the user spectral information. In the embodiment of the invention, the user voice spectrogram can be obtained by processing the user frequency spectrum information. For example, firstly converting the user Spectrum information into a first waveform diagram corresponding to the user Spectrum information time domain, dividing the first waveform diagram into a plurality of first single-frame waveform diagrams with equal duration, continuously sampling each first single-frame waveform diagram to obtain a plurality of sampling points, performing FFT (fourier transform) operation on the plurality of sampling points to obtain a plurality of first single-frame frequency spectrograms (spectra), and performing inversion operation and gray operation on each first single-frame frequency spectrogram to obtain a first one-dimensional gray Amplitude diagram, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing Amplitude (Amplitude); and finally, splicing the plurality of first one-dimensional gray level amplitude graphs to obtain a user voice spectrogram corresponding to the user frequency spectrum information. For example, when the plurality of sampling points is 4096 sampling points, the duration of each first single-frame waveform chart is 1/10 second(s), and the value corresponding to each point in the user voice spectrogram corresponding to the first waveform chart is the amplitude of the corresponding frequency. Therefore, the user voice spectrogram corresponding to the user frequency spectrum information reflects the frequency distribution condition of the audio in time.

Step S102, extracting user voice attributes corresponding to the user audio data from the user voice spectrogram, wherein the user voice attributes comprise style attributes.

Extracting user voice attributes, such as style attributes, through a voice spectrogram: happiness, anger, etc., but may also be other attributes such as speech rate, gender, etc.

Illustratively, the step S102 may further include: extracting voice attributes of the user voice spectrogram through a target generator to obtain the user voice attributes corresponding to the user audio data;

Step S104, obtaining the audio data to be edited, and converting the audio data to be edited into a voice spectrogram to be edited.

The audio data to be edited can be obtained from a voice call, wherein the voice call can be a mobile phone call, a WeChat call, a video call and the like; the audio data to be edited is the audio data generated by the user call object in the voice call. For example, audio data to be edited of a user object is acquired when the user is talking, and the user audio data is converted into a voice spectrogram.

For example, as shown in fig. 8, the step S104 may further include:

step S104a, extracting the frequency spectrum information to be edited of the audio data to be edited.

Step S104b, a second waveform diagram corresponding to the time domain is generated according to the frequency spectrum information to be edited.

Step S104c, carrying out framing processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams.

In step S104d, fourier transform is performed on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, where a horizontal axis of each second single-frame frequency spectrogram is used to represent frequency, and a vertical axis of each second single-frame frequency spectrogram is used to represent amplitude.

Step S104e, performing inversion operation and gray scale operation on each second single frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging the horizontal axis and the vertical axis in the second single frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single frame frequency spectrogram after the inversion operation through gray scale values.

Step S104f, synthesizing the plurality of second one-dimensional gray scale amplitude diagrams to obtain a voice spectrogram to be edited.

The voice spectrogram to be edited is an image reflecting the relation between signal frequency and energy, and the second waveform chart is a section of continuous sound waveform signal chart generated according to the frequency spectrum information to be edited. In the embodiment of the invention, the voice spectrogram to be edited can be obtained by processing the frequency spectrum information to be edited. For example, the spectral information to be edited is first converted into a second waveform diagram corresponding to the time domain of the spectral information to be edited, the second waveform diagram is divided into a plurality of second single-frame waveform diagrams with equal duration, each of the second single-frame waveform diagrams is continuously sampled to obtain a plurality of sampling points, then FFT (fourier transform) operation is performed on the plurality of sampling points to obtain a plurality of second single-frame frequency spectrograms, inversion operation and gray operation are performed on each of the second single-frame frequency spectrograms to obtain a second one-dimensional gray amplitude diagram, wherein a horizontal axis of each of the second single-frame frequency spectrograms is used for representing frequency, and a vertical axis of each of the second single-frame frequency spectrograms is used for representing amplitude; and finally, splicing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited, which corresponds to the frequency spectrum information to be edited. For example, when the plurality of sampling points is 4096 sampling points, the duration of each second single-frame waveform chart is 1/10 second(s), and the value corresponding to each point in the voice spectrogram to be edited corresponding to the second waveform chart is the amplitude of the corresponding frequency. Therefore, the frequency distribution of the audio in time is reflected by the voice spectrogram to be edited corresponding to the frequency spectrum information to be edited.

Step S106, generating a target voice spectrogram according to the voice attribute of the user and the voice spectrogram to be edited.

In order to better interact with the user, when the user voice attribute is anger, combining the gentle user voice attribute with the voice spectrogram to be edited to generate a target voice spectrogram with the gentle attribute, namely combining the context, and carrying out interactive dialogue with the user more truly and naturally with emotional personification infectivity.

As shown in fig. 9, the step S106 may further include:

step S106a, according to the user voice attribute and the mapping relation diagram, obtaining a target voice attribute corresponding to the user voice attribute.

Step S106b, inputting the target voice attribute and the voice spectrogram to be edited into a target generator to obtain a target voice spectrogram.

Exemplary, the step of configuring the mapping relationship map includes: inputting a plurality of real voice spectrograms into a target generator to obtain one or more voice attributes corresponding to each user voice spectrogram; generating a mapping relation diagram between one or more voice attributes and another designated voice attribute, and storing the mapping relation to a database.

Illustratively, the speech generation model determines a target speech style matched with the call object language according to the speech style of the call object language in the call, and generates a final speech spectrogram of the target speech style matched with the call object language.

For example, as shown in fig. 10, the step S106b may further include:

step S106b1, determining a target attribute area to which the voice spectrogram to be edited belongs through the spatial attention network;

step S106b2, inputting the voice spectrogram to be edited and the target voice attribute in the target attribute area into the attribute editing network to obtain the target voice spectrogram, where the target voice spectrogram is the voice spectrogram to be edited carrying the target voice attribute.

In some embodiments, the target generator G may convert the input voice spectrogram I to be edited into an edited target voice spectrogram on the condition of the target voice attribute cFor example, a->The target generator G comprises two parts, namely a neural network F with image style attribute migration capability for the attribute editing network _m For example, the property editing network may generate a spectrogram C with the attribute style in the spectrogram a and the spectrogram B according to the spectrogram a of the given content and the spectrogram B with the attribute style; the spatial attention network is a convolutional neural network F with attention capability _a The method comprises the steps of carrying out a first treatment on the surface of the Wherein the property editing network focuses on how to edit, and the spatial attention network focuses on where to edit. For example, the property editing network takes the voice spectrogram I to be edited and the target voice property c as inputs and outputs an edited target voice spectrogram I _a For example, I _a ＝F _m (I, c); the spatial attention network takes a voice spectrogram I to be edited as input, predicts a spatial attention mask for limiting the operation of the attribute editing network to a target attribute region: b=f _a (I) The method comprises the steps of carrying out a first treatment on the surface of the Ideally, the attention value of the style attribute related region in b should be 1, and the other regions 0. In practice, notes passing through style attributesThe meaning value is a continuous value between 0 and 1. Thus, a region with an attention value other than 0 is regarded as a style attribute related region, and the remaining regions with an attention value of 0 are regarded as style attribute independent regions.

Step S108, generating a voice signal for output according to the target voice spectrogram.

And outputting the speech signal with the appointed speech style reconstructed by the target speech spectrogram through a signal reconstruction speech algorithm.

Illustratively, the method further comprises training steps (1) - (7) of the GAN model.

The GAN model also includes a discriminator including a true-false classifier D _src And attribute classifier D _cls The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the true and false classifier D _src And the attribute classifier D _cls Are convolutional neural networks CNN with Softmax functions. The true and false classifier D _src And attribute classifier D _cls A portion of the convolutional layer at the beginning may be shared, followed by a different fully-connected layer, for different classifications. For example, the output D of the true-false classifier _src (I) Representing the probability that the speech spectrogram I to be edited is true, and the output D of the attribute classifier _cls (c|I) represents the probability that the voice spectrogram I to be edited has a sound style attribute c, wherein c epsilon 0,1 is binary, and when c is 1, the voice spectrogram I to be edited contains the sound style attribute c, and when c is 0, the voice spectrogram I to be edited does not contain the sound style attribute c. The input voice spectrogram to be edited can be a real voice conversion obtained voice spectrogram or a machine-generated voice conversion obtained voice spectrogram.

(1) Acquiring a sample spectrogram and a sample attribute label corresponding to the sample spectrogram, wherein the sample spectrogram comprises a voice spectrogram; (2) Inputting the sample spectrogram and the sample attribute tag into a GAN model; (3) Determining a sample attribute region to which the sample spectrogram belongs through the spatial attention network; (4) Inputting a sample spectrogram and the sample attribute label in the sample attribute area into the attribute editing network to obtain a generated spectrogram corresponding to the sample spectrogram; (5) Inputting the sample spectrogram and the generated spectrogram into a discriminator of the GAN model, and judging whether the generated spectrogram accords with the graph distribution of the voice spectrogram or not through a true and false classifier in the discriminator; (6) If the generated spectrogram accords with the graph distribution of the user voice spectrogram, predicting sample voice attributes of the voice spectrogram through an attribute classifier in the discriminator; and (7) comparing the attribute difference between the sample voice attribute and the sample attribute label, and adjusting the parameters of the GAN model according to the attribute difference to obtain a target GAN model.

In some embodiments, the GAN model is under the direction of a spatial attention network (attention mask) and is based on a final edited target speech spectrogramIn which the style attribute related region is edited towards the target style attribute while the other regions remain unchanged:

in order to make the edited target voice spectrogramMore closely to the true speech spectrum, the true-false classifier can be tuned here by an opposing loss (absolute loss) function:

in order to makeCorrectly with the target style attribute c, a style attribute classification loss function is used to drive the attribute classifier about +.>Is close to the target value c:

to keep the sound style attribute independent, a reconstruction loss (reconstruction loss) function is used:

wherein c ^g Is the original style attribute of input voice spectrogram I to be edited, lambda ₁ And lambda (lambda) ₂ Is two equilibrium parameters. Wherein lambda is ₁ (dual reconstruction loss) the purpose of the method is to make the edited target speech spectrogramSimilar to the voice spectrogram I to be edited; lambda (lambda) ₂ (identity reconstruction loss) the purpose of the method is to make the input speech spectrogram I to be edited have its own sound style attribute c ^g Editing is performed without modification.

Finally, optimizing the generator G:

for the entire such countermeasure generation network GAN model with spatial attention network, the generator G and the arbiter D can be trained in a countermeasure manner.

The GAN model also comprises a discriminator including a true-false classifier D _src And attribute classifier D _cls . Wherein the loss function of the optimized true/false (real/false) classifier is a standard cross-entropy loss (standard cross-entropy loss) function:

wherein I is a voice spectrogram to be edited,is the target speech spectrogram.

The loss function of the optimized attribute classifier is also a standard cross entropy loss:

wherein c ^g Is the manual labeling style attribute of the voice spectrogram I to be edited.

The overall loss function of the arbiter D can be expressed as:

by minimizing the loss function, the obtained discriminator D can well separate the voice spectrogram to be edited from the target voice spectrogram and correctly predictThe probability of containing c.

Example two

Fig. 11 is a schematic diagram of a program module of a second embodiment of the speech generating system according to the present invention. The speech generating system 20 may include or be partitioned into one or more program modules that are stored in a storage medium and executed by one or more processors to perform the present invention and to implement the speech generating methods described above. Program modules depicted in the embodiments of the present invention are directed to a series of computer program instruction segments capable of performing the specified functions, which are more suitable than the program itself for describing the execution of speech generating system 20 in a storage medium. The following description will specifically describe functions of each program module of the present embodiment:

The first obtaining module 200 is configured to obtain user audio data, and convert the user audio data into a user voice spectrogram.

Illustratively, the determining module 200 is further configured to: extracting user spectrum information of the user audio data;

generating a first waveform diagram corresponding to a time domain according to the user frequency spectrum information; carrying out frame division processing on the first oscillogram to obtain a plurality of first single-frame oscillograms; performing Fourier transform operation on each first single-frame waveform diagram to obtain a plurality of first single-frame frequency spectrograms, wherein the horizontal axis of each first single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each first single-frame frequency spectrogram is used for representing amplitude; performing inversion operation and gray scale operation on each first single-frame frequency spectrogram to obtain a plurality of first one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the first single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the first single-frame frequency spectrogram after the inversion operation through gray scale values; and synthesizing the plurality of first one-dimensional gray scale amplitude diagrams to obtain the user voice spectrogram.

The attribute extraction module 202 is configured to extract, from the user voice spectrogram, a user voice attribute corresponding to the user audio data, where the user voice attribute includes a style attribute.

Illustratively, the attribute extraction module 202 is further configured to: analyzing the user voice spectrogram through a GAN model to obtain user voice attributes of the user audio data; the GAN model includes a generator including a spatial attention network and a property editing network, and a discriminator including a true-false classifier and a property classifier

Illustratively, the attribute extraction module 202 is further configured to: determining a target attribute area to which the user voice spectrogram belongs through the spatial attention network; inputting the user voice spectrogram in the target attribute area into the attribute editing network to obtain a generated voice spectrogram with user voice attributes; inputting the generated voice spectrogram and the user voice spectrogram into the discriminator, and judging whether the generated voice spectrogram accords with the graphic distribution of the user voice spectrogram or not through a true-false classifier in the discriminator; and if the generated voice spectrogram accords with the graph distribution of the user voice spectrogram, predicting the voice attribute of the user voice spectrogram through an attribute classifier in the discriminator to obtain the user voice attribute.

The second obtaining module 204 is configured to obtain audio data to be edited, and convert the audio data to be edited into a voice spectrogram to be edited;

illustratively, the second obtaining module 204 is further configured to: extracting spectral information to be edited of the audio data to be edited; generating a second waveform diagram corresponding to a time domain according to the frequency spectrum information to be edited; carrying out frame division processing on the second waveform diagram to obtain a plurality of second single-frame waveform diagrams; performing Fourier transform operation on each second single-frame waveform diagram to obtain a plurality of second single-frame frequency spectrograms, wherein the horizontal axis of each second single-frame frequency spectrogram is used for representing frequency, and the vertical axis of each second single-frame frequency spectrogram is used for representing amplitude; performing inversion operation and gray scale operation on each second single-frame frequency spectrogram to obtain a plurality of second one-dimensional gray scale amplitude charts, wherein the inversion operation is used for exchanging a horizontal axis and a vertical axis in the second single-frame frequency spectrogram, and the gray scale operation is used for representing the amplitude in the second single-frame frequency spectrogram after the inversion operation through gray scale values; and synthesizing the plurality of second one-dimensional gray level amplitude graphs to obtain a voice spectrogram to be edited.

The voice editing module 206 is configured to generate a target voice spectrogram according to the user voice attribute and the voice spectrogram to be edited.

Illustratively, the voice editing module 206 is further configured to: acquiring voice attributes mapped with the user voice attributes according to the user voice attributes and the mapping relation diagram, and determining target voice attributes corresponding to the user voice attributes; editing the voice spectrogram to be edited according to the target voice attribute to obtain a target voice spectrogram

A voice generating module 208, configured to generate a voice signal for output according to the target voice spectrogram.

Example III

Referring to fig. 12, a hardware architecture diagram of a computer device according to a third embodiment of the invention is shown. In this embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server, or a rack server (including a stand-alone server, or a server cluster made up of multiple servers), or the like. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a speech generating system 20 communicatively coupled to each other via a system bus.

In this embodiment, the memory 21 includes at least one type of computer-readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 2. Of course, the memory 21 may also include both internal storage units of the computer device 2 and external storage devices. In this embodiment, the memory 21 is typically used to store an operating system and various types of application software installed on the computer device 2, such as program codes of the speech generating system 20 of the second embodiment. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, execute the speech generating system 20, to implement the speech generating method of the first embodiment.

The network interface 23 may comprise a wireless network interface or a wired network interface, which network interface 23 is typically used for establishing a communication connection between the computer apparatus 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or other wireless or wired network.

It is noted that fig. 12 only shows a computer device 2 having components 20-23, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.

In the present embodiment, the speech generating system 20 stored in the memory 21 may also be divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (the processor 22 in the present embodiment) to complete the present invention.

For example, fig. 11 shows a schematic diagram of a program module for implementing the speech generating system 20 according to the second embodiment of the present invention, where the speech generating system 20 may be divided into a first obtaining module 200, an attribute extraction module 202, a second obtaining module 204, a speech editing module 206 and a speech generating module 208. Program modules in the present invention are understood to mean a series of computer program instruction segments, which can perform a specific function, more appropriately than a program, describing the execution of the speech generating system 20 in the computer device 2. The specific functions of the program modules 200-208 are described in detail in the second embodiment, and are not described herein.

Example IV

The present embodiment also provides a computer-readable storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs the corresponding functions. The computer-readable storage medium of the present embodiment is used in the speech generating system 20, and when executed by a processor, implements the speech generating method of the first embodiment.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of speech generation, the method comprising:

inputting the voice attribute of the user and the voice spectrogram to be edited into a target generator to generate a target voice spectrogram; and

Generating a voice signal for output according to the target voice spectrogram;

the extracting the user voice attribute corresponding to the user audio data from the user voice spectrogram includes:

extracting voice attributes of the user voice spectrogram through the target generator to obtain the user voice attributes corresponding to the user audio data;

2. The speech generating method of claim 1, wherein converting the user audio data into a user speech spectrogram comprises:

extracting user spectrum information of the user audio data;

3. The speech generation method of claim 1, wherein the method further comprises the training step of the GAN model:

inputting the sample spectrogram and the sample attribute tag into a GAN model;

4. The method of claim 1, wherein the obtaining audio data to be edited and converting the audio data to be edited into a speech spectrogram comprises:

Extracting spectral information to be edited of the audio data to be edited;

5. The speech generating method of claim 1, wherein generating a target speech spectrogram from the user speech attribute and the speech spectrogram to be edited comprises:

6. The speech generating method of claim 5, wherein inputting the target speech attribute and the speech spectrogram to be edited into a target generator to obtain a target speech spectrogram comprises:

7. A speech generation system, comprising:

the voice editing acquisition module is used for inputting the voice spectrogram to be edited into a target generator according to the voice attribute of the user and the voice spectrogram to be edited to generate a target voice spectrogram; and

The voice generation module is used for generating a voice signal for output according to the target voice spectrogram;

the attribute extraction module is further configured to extract, by using the target generator, a voice attribute of the user voice spectrogram, so as to obtain the user voice attribute corresponding to the user audio data; the target generator is a generator in a pre-trained target GAN model, and comprises a spatial attention network and a property editing network, wherein the spatial attention network is used for determining a property area of a voice spectrogram, and the property editing network is used for carrying out voice property editing and voice property extraction on the voice spectrogram of the property area.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the speech generating method according to any of claims 1 to 6.

9. A computer-readable storage medium, in which a computer program is stored, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the speech generating method according to any one of claims 1 to 6.