CN112397047A

CN112397047A - Speech synthesis method, device, electronic equipment and readable storage medium

Info

Publication number: CN112397047A
Application number: CN202011442571.2A
Authority: CN
Inventors: 陈闽川; 马骏; 王少军; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-02-23
Also published as: WO2022121176A1

Abstract

The invention relates to a voice synthesis technology, and discloses a voice synthesis method, which comprises the following steps: acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector; when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence; performing vector conversion on the text phoneme sequence to obtain a text matrix; performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix; extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio. The invention also relates to a blockchain technique, wherein the spectrum characteristic information can be stored in the blockchain. The invention also provides a voice synthesis device, electronic equipment and a readable storage medium. The invention can improve the flexibility of voice synthesis.

Description

Speech synthesis method, device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and readable storage medium.

Background

With the development of artificial intelligence, speech synthesis is an important component of artificial intelligence, and can convert any text information into standard smooth speech in real time for reading, which is equivalent to mounting an artificial mouth on a machine, so that the speech synthesis technology is more and more emphasized by people.

However, at present, the speech synthesis method can only synthesize a certain style or language of speech from text, such as: the mandarin Chinese which only can synthesize the Beijing accent from the Chinese text can not synthesize the Sichuan accent or the Japanese accent; the requirements of people on multiple styles of speech synthesis cannot be met, and the flexibility of speech synthesis is poor.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, and mainly aims to improve the flexibility of voice synthesis.

In order to achieve the above object, the present invention provides a speech synthesis method, including:

acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;

when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;

performing vector conversion on the text phoneme sequence to obtain a text matrix;

performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;

extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;

and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.

Optionally, the performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard speech vector includes:

carrying out sound feature extraction and conversion on the sample audio to obtain a target spectrogram;

and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector.

Optionally, the performing sound feature extraction and conversion on the sample audio to obtain a target spectrogram includes:

resampling the sample audio to obtain a digital voice signal;

pre-emphasis is carried out on the digital voice signal to obtain a standard digital voice signal;

and performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram.

Optionally, the performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard speech vector includes:

acquiring the output of all nodes of a full connection layer contained in the image classification model to obtain a target spectrogram characteristic value set;

and longitudinally combining the characteristic values in the target spectrogram characteristic value set according to the sequence of all the nodes of the full connection layer to obtain a standard voice vector.

Optionally, the performing feature conversion on the standard digital speech signal to obtain the target spectrogram includes:

and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram.

Optionally, vector splicing is performed on the standard speech vector and the text matrix to obtain a target matrix, where the vector splicing includes:

calculating the phoneme frame length of each phoneme in the text phoneme sequence by using a preset algorithm model to obtain a phoneme frame length sequence;

converting the phoneme frame length sequence into a phoneme frame length vector;

transversely splicing the phoneme frame length vector and the text matrix to obtain a standard text matrix;

and longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix.

Optionally, the performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence includes:

performing punctuation deletion on the text to be synthesized to obtain a standard text;

and marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence.

In order to solve the above problem, the present invention also provides a speech synthesis apparatus, comprising:

the audio processing module is used for acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;

the text processing module is used for carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence when the text to be synthesized is received; performing vector conversion on the text phoneme sequence to obtain a text matrix; performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;

the voice synthesis module is used for extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one computer program; and

and a processor executing the computer program stored in the memory to implement the speech synthesis method described above.

In order to solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the speech synthesis method described above.

The embodiment of the invention carries out sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector; when a text to be synthesized is received, performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence, eliminating the difference of pronunciations of different types of characters, and enabling the speech synthesis to be more flexible; performing vector conversion on the text phoneme sequence to obtain a text matrix; vector splicing is carried out on the standard voice vector and the text matrix to obtain a target matrix, so that the flexible combination of the characteristics of the voice and the characteristics of the text to be synthesized is realized, and the flexible synthesis of the subsequent voice is ensured; extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio. Therefore, the speech synthesis method, the speech synthesis device, the electronic equipment and the computer-readable storage medium provided by the embodiment of the invention improve the flexibility of speech synthesis.

Drawings

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of obtaining a target spectrogram in a speech synthesis method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating a process of obtaining a standard speech vector in a speech synthesis method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 5 is a schematic internal structural diagram of an electronic device implementing a speech synthesis method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a voice synthesis method. The execution subject of the speech synthesis method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the speech synthesis method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Referring to fig. 1, which is a schematic flow diagram of a speech synthesis method according to an embodiment of the present invention, in an embodiment of the present invention, the speech synthesis method includes:

s1, obtaining a sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;

in the embodiment of the present invention, the sample audio is the voice data of the target speaker to be obtained subsequently, such as: the subsequent text is synthesized into the speech of speaker a, then the sample audio is the number of speakers a's speech.

Further, in order to make the speech synthesis of the subsequent text more accurate, the invention performs feature extraction processing on the sample audio to obtain the standard speech vector.

Because the voice data capacity is large and is not easy to process, the sample audio is subjected to sound feature extraction and conversion to obtain a spectrogram.

In detail, in the embodiment of the present invention, referring to fig. 2, the performing sound feature extraction and conversion on the sample audio to obtain a target spectrogram includes:

s11, resampling the sample audio to obtain a digital voice signal;

in the embodiment of the present invention, in order to facilitate data processing on the sample audio, the sample audio is resampled to obtain the digital voice signal, and preferably, the embodiment of the present invention uses a digital-to-analog converter to resample the sample audio.

S12, pre-emphasizing the digital voice signal to obtain a standard digital voice signal;

in detail, the embodiment of the present invention performs the pre-emphasis operation by using the following formula:

y(t)＝x(t)-μx(t-1)

wherein x (t) is the digital speech signal, t is time, y (t) is the standard digital speech signal, and μ is a preset adjustment value of the pre-emphasis operation, and preferably, μ has a value range of [0.9,1.0 ].

S13, performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram;

in the embodiment of the invention, the standard digital voice signal can only reflect the change of the audio frequency in the time domain and cannot reflect the audio frequency characteristics of the standard voice signal, and in order to reflect the audio frequency characteristics of the standard voice signal, the audio frequency characteristics are more visual and clear, and the standard digital voice signal is subjected to characteristic conversion.

In detail, in the embodiment of the present invention, the performing feature conversion on the standard digital speech signal includes: and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram. Preferably, the sound processing algorithm in the embodiment of the present invention is a mel filtering algorithm.

Further, in order to further simplify and utilize data and improve data processing efficiency, an embodiment of the present invention performs vectorization processing on the target spectrogram, including: and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector. Preferably, in an embodiment of the present invention, the pre-constructed image classification model is a residual error network model trained by using a historical spectrogram atlas, where the historical spectrogram atlas is a plurality of spectrogram collections with the same types and different contents as the target spectrogram.

In detail, in the embodiment of the present invention, referring to fig. 3, the extracting features of the target spectrogram by using the pre-constructed image classification model to obtain the standard speech vector includes:

s21, obtaining the output of all nodes of the full-link layer contained in the image classification model to obtain a target spectrogram feature value set;

for example: the total connection layer of the image classification model comprises 1000 nodes, a target spectrogram T is input into the image classification model, 1000 node output values are obtained, and a target spectrogram feature value set of the target spectrogram T is obtained, wherein the output of each node is one feature value of the target spectrogram T, so that the target spectrogram feature value set of the target spectrogram T has 1000 feature values in total.

S22, longitudinally combining the characteristic values in the target spectrogram characteristic value set according to the sequence of all the nodes of the full connection layer to obtain a standard voice vector;

for example: the full connection layer is provided with 3 nodes which are respectively a first node, a second node and a third node in sequence, the target spectrogram feature value set of the target spectrogram A is provided with 3 feature values of 3,5 and 1, wherein the feature value 1 is output of the first node, the feature value 3 is output of the second node and the feature value 5 is output of the third node, and the three feature values in the target spectrogram feature value set of the target spectrogram A are longitudinally combined according to the node sequence to obtain a standard voice vector of the target spectrogram A

S2, when receiving a text to be synthesized, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;

in the embodiment of the present invention, the text to be synthesized is a text requiring speech synthesis, and phonemes of pronunciations of texts with different speeches may be represented by a general phonetic symbol rule.

In detail, in the embodiment of the present invention, the performing phoneme conversion on the text to be synthesized to obtain the text phoneme sequence includes: performing punctuation deletion on the text to be synthesized to obtain a standard text; marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence, wherein the preset phonetic symbol rule comprises the following steps: the preset phonetic symbol rule is an international phonetic symbol rule, the corresponding phoneme marked with a character "o" is a, and the obtained text phoneme sequence is [ a ].

S3, carrying out vector conversion on the text phoneme sequence to obtain a text matrix;

in the embodiment of the invention, each phoneme in the text phoneme sequence is converted into a column vector by using an onehot coding algorithm, so as to obtain the text matrix.

S4, carrying out vector splicing on the standard voice vector and the text matrix to obtain a target matrix;

in detail, in the embodiment of the present invention, in order to perform speech synthesis better subsequently, it is further required to determine that each phoneme in the text phoneme sequence performs speech alignment, that is, to determine a pronunciation duration, that is, a phoneme frame length, of each phoneme in the text phoneme sequence, so that in the embodiment of the present invention, a preset algorithm model is used to calculate the phoneme frame length of each phoneme in the text phoneme sequence to obtain a phoneme frame length sequence, where the preset algorithm model in the embodiment of the present invention may be a DNN-HMM network model.

Further, in the embodiment of the present invention, the phoneme frame length sequence is converted into a phoneme frame length vector, that is, the phoneme frame length sequence is converted into a corresponding row vector to obtain the phoneme frame length vector, and the phoneme frame length vector and the text matrix are transversely spliced to obtain the standard text matrix, for example: the phoneme frame length vector is a row vector of 1 × 4, the text matrix is a matrix of 5 × 4, and the phoneme frame length vector is used as a fifth row of the text matrix to obtain the standard text matrix of 6 × 4.

In detail, in the embodiment of the present invention, each of the standard speech vector and the standard text matrix is divided into two partsOne column is longitudinally spliced to obtain the target matrix, for example: the standard text matrix is

The standard speech vector is

Longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix of

S5, extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;

in order to further perform speech synthesis, the embodiment of the present invention further needs to determine a spectral feature of the target matrix, where the spectral feature may be Mel-frequency spectrum.

In detail, in the embodiment of the present invention, the trained acoustic model is used to perform spectrum feature extraction on the target matrix, so as to obtain the spectrum feature extraction. Preferably, the acoustic model may be a transform model.

Further, before extracting the spectral feature of the target matrix by using the trained acoustic model, the method further includes: acquiring a historical text matrix set; performing frequency spectrum characteristic information marking on each historical text matrix of the historical text matrix set to obtain a training set; and training the acoustic model by using the training set until the acoustic model converges to obtain the trained acoustic model. The historical text matrix set is a set of a plurality of historical text matrixes, and the historical text matrixes are target matrixes corresponding to texts different from the texts to be synthesized.

In another embodiment of the present invention, in order to ensure the privacy of data, the spectrum feature information may be stored in a block link point.

And S6, performing voice synthesis on the spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.

In detail, in the embodiment of the present invention, the spectral feature information is input to a preset vocoder, so as to obtain the synthesized audio.

Preferably, the vocoder is a WORLD vocoder.

Fig. 4 is a functional block diagram of the speech synthesis apparatus according to the present invention.

The speech synthesis apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the speech synthesis apparatus may include an audio processing module 101, a word processing module 102, and a speech synthesis module 103, which may also be referred to as a unit, and refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform fixed functions, and are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the audio processing module 101 is configured to obtain a sample audio, perform sound feature extraction conversion and vectorization processing on the sample audio, and obtain a standard speech vector.

Further, in order to make the speech synthesis of the subsequent text more accurate, the audio processing module 101 performs feature extraction processing on the sample audio to obtain the standard speech vector.

Because the volume of the voice data is large and is not easy to process, the audio processing module 101 performs sound feature extraction and conversion on the sample audio to obtain a target spectrogram.

In detail, in the embodiment of the present invention, the audio processing module 101 performs sound feature extraction and conversion on the sample audio by using the following means to obtain a target spectrogram, including:

resampling the sample audio to obtain a digital voice signal;

y(t)＝x(t)-μx(t-1)

Performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram;

In detail, in the embodiment of the present invention, the performing, by the audio processing module 101, feature conversion on the standard digital voice signal includes: and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram. Preferably, the sound processing algorithm in the embodiment of the present invention is a mel filtering algorithm.

Further, in order to further simplify and utilize data and improve data processing efficiency, the audio processing module 101 of the embodiment of the present invention performs vectorization processing on the target spectrogram, including: and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector. Preferably, in an embodiment of the present invention, the pre-constructed image classification model is a residual error network model trained by using a historical spectrogram atlas, where the historical spectrogram atlas is a plurality of spectrogram collections with the same types and different contents as the target spectrogram.

In detail, in the embodiment of the present invention, the extracting features of the target spectrogram by the audio processing module 101 by using the following means to obtain the standard speech vector includes:

According to the sequence of all nodes of the full-connection layer, longitudinally combining the characteristic values in the target spectrogram characteristic value set to obtain a standard voice vector;

The text processing module 102 is configured to, when receiving a text to be synthesized, perform phoneme conversion on the text to be synthesized to obtain a text phoneme sequence; performing vector conversion on the text phoneme sequence to obtain a text matrix; and carrying out vector splicing on the standard voice vector and the text matrix to obtain a target matrix.

In detail, in this embodiment of the present invention, the performing, by the text processing module 102, phoneme conversion on the text to be synthesized to obtain the text phoneme sequence includes: performing punctuation deletion on the text to be synthesized to obtain a standard text; marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence, wherein the preset phonetic symbol rule comprises the following steps: the preset phonetic symbol rule is an international phonetic symbol rule, the corresponding phoneme marked with a character "o" is a, and the obtained text phoneme sequence is [ a ].

In this embodiment of the present invention, the text processing module 102 converts each phoneme in the text phoneme sequence into a column vector by using an onehot coding algorithm, so as to obtain the text matrix.

In detail, in this embodiment of the present invention, in order to perform speech synthesis better subsequently, it is further required to determine that each phoneme in the text phoneme sequence performs speech alignment, that is, to determine a pronunciation duration, that is, a phoneme frame length, of each phoneme in the text phoneme sequence, so that in this embodiment of the present invention, the text processing module 102 calculates the phoneme frame length of each phoneme in the text phoneme sequence by using a preset algorithm model to obtain a phoneme frame length sequence, where the preset algorithm model may be a DNN-HMM network model.

Further, in this embodiment of the present invention, the text processing module 102 converts the phoneme frame length sequence into a phoneme frame length vector, that is, converts the phoneme frame length sequence into a corresponding row vector to obtain the phoneme frame length vector, and transversely splices the phoneme frame length vector and the text matrix to obtain the standard text matrix, for example: the phoneme frame length vector is a row vector of 1 × 4, the text matrix is a matrix of 5 × 4, and the phoneme frame length vector is used as a fifth row of the text matrix to obtain the standard text matrix of 6 × 4.

In detail, in the embodiment of the present invention, the text processing module 102 performs vertical concatenation on the standard speech vector and each column of the standard text matrix to obtain the target matrix, for example: the standard text matrix is

The standard speech vector is

The voice synthesis module 103 is configured to perform spectrum feature extraction on the target matrix to obtain spectrum feature information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.

Further, before the speech synthesis module 103 extracts the spectral feature of the target matrix by using the trained acoustic model in the embodiment of the present invention, the method further includes: acquiring a historical text matrix set; performing frequency spectrum characteristic information marking on each historical text matrix of the historical text matrix set to obtain a training set; and training the acoustic model by using the training set until the acoustic model converges to obtain the trained acoustic model. The historical text matrix set is a set of a plurality of historical text matrixes, and the historical text matrixes are target matrixes corresponding to texts different from the texts to be synthesized.

In detail, in the embodiment of the present invention, the speech synthesis module 103 inputs the spectrum feature information to a preset vocoder to obtain the synthesized audio.

Preferably, the vocoder is a WORLD vocoder.

Fig. 5 is a schematic structural diagram of an electronic device implementing the speech synthesis method according to the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a speech synthesis program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a speech synthesis program, but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a speech synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a PerIPheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The speech synthesis program 12 stored in the memory 11 of the electronic device 1 is a combination of computer programs that, when executed in the processor 10, enable:

Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

Embodiments of the present invention may also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of an electronic device, the computer program may implement:

Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method of speech synthesis, the method comprising:

2. The speech synthesis method of claim 1, wherein the performing the acoustic feature extraction conversion and the vectorization process on the sample audio to obtain a standard speech vector comprises:

3. The speech synthesis method of claim 2, wherein the performing acoustic feature extraction conversion on the sample audio to obtain a target spectrogram comprises:

resampling the sample audio to obtain a digital voice signal;

4. The speech synthesis method of claim 2, wherein the extracting features of the target spectrogram by using the pre-constructed image classification model to obtain the standard speech vector comprises:

5. The speech synthesis method of claim 3, wherein said performing feature conversion on said standard digital speech signal to obtain said target spectrogram comprises:

6. The speech synthesis method of claim 1, wherein the vector-splicing the standard speech vector with the text matrix to obtain a target matrix comprises:

7. The speech synthesis method according to any one of claims 1 to 6, wherein the performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence comprises:

8. A speech synthesis apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the speech synthesis method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech synthesis method according to any one of claims 1 to 7.