CN112397047A - Speech synthesis method, device, electronic equipment and readable storage medium - Google Patents

Speech synthesis method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112397047A
CN112397047A CN202011442571.2A CN202011442571A CN112397047A CN 112397047 A CN112397047 A CN 112397047A CN 202011442571 A CN202011442571 A CN 202011442571A CN 112397047 A CN112397047 A CN 112397047A
Authority
CN
China
Prior art keywords
text
vector
standard
phoneme
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011442571.2A
Other languages
Chinese (zh)
Inventor
陈闽川
马骏
王少军
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011442571.2A priority Critical patent/CN112397047A/en
Publication of CN112397047A publication Critical patent/CN112397047A/en
Priority to PCT/CN2021/083824 priority patent/WO2022121176A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a voice synthesis technology, and discloses a voice synthesis method, which comprises the following steps: acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector; when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence; performing vector conversion on the text phoneme sequence to obtain a text matrix; performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix; extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio. The invention also relates to a blockchain technique, wherein the spectrum characteristic information can be stored in the blockchain. The invention also provides a voice synthesis device, electronic equipment and a readable storage medium. The invention can improve the flexibility of voice synthesis.

Description

Speech synthesis method, device, electronic equipment and readable storage medium
Technical Field
The present invention relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and readable storage medium.
Background
With the development of artificial intelligence, speech synthesis is an important component of artificial intelligence, and can convert any text information into standard smooth speech in real time for reading, which is equivalent to mounting an artificial mouth on a machine, so that the speech synthesis technology is more and more emphasized by people.
However, at present, the speech synthesis method can only synthesize a certain style or language of speech from text, such as: the mandarin Chinese which only can synthesize the Beijing accent from the Chinese text can not synthesize the Sichuan accent or the Japanese accent; the requirements of people on multiple styles of speech synthesis cannot be met, and the flexibility of speech synthesis is poor.
Disclosure of Invention
The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, and mainly aims to improve the flexibility of voice synthesis.
In order to achieve the above object, the present invention provides a speech synthesis method, including:
acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;
performing vector conversion on the text phoneme sequence to obtain a text matrix;
performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;
and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
Optionally, the performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard speech vector includes:
carrying out sound feature extraction and conversion on the sample audio to obtain a target spectrogram;
and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector.
Optionally, the performing sound feature extraction and conversion on the sample audio to obtain a target spectrogram includes:
resampling the sample audio to obtain a digital voice signal;
pre-emphasis is carried out on the digital voice signal to obtain a standard digital voice signal;
and performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram.
Optionally, the performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard speech vector includes:
acquiring the output of all nodes of a full connection layer contained in the image classification model to obtain a target spectrogram characteristic value set;
and longitudinally combining the characteristic values in the target spectrogram characteristic value set according to the sequence of all the nodes of the full connection layer to obtain a standard voice vector.
Optionally, the performing feature conversion on the standard digital speech signal to obtain the target spectrogram includes:
and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram.
Optionally, vector splicing is performed on the standard speech vector and the text matrix to obtain a target matrix, where the vector splicing includes:
calculating the phoneme frame length of each phoneme in the text phoneme sequence by using a preset algorithm model to obtain a phoneme frame length sequence;
converting the phoneme frame length sequence into a phoneme frame length vector;
transversely splicing the phoneme frame length vector and the text matrix to obtain a standard text matrix;
and longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix.
Optionally, the performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence includes:
performing punctuation deletion on the text to be synthesized to obtain a standard text;
and marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence.
In order to solve the above problem, the present invention also provides a speech synthesis apparatus, comprising:
the audio processing module is used for acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
the text processing module is used for carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence when the text to be synthesized is received; performing vector conversion on the text phoneme sequence to obtain a text matrix; performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
the voice synthesis module is used for extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
and a processor executing the computer program stored in the memory to implement the speech synthesis method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the speech synthesis method described above.
The embodiment of the invention carries out sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector; when a text to be synthesized is received, performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence, eliminating the difference of pronunciations of different types of characters, and enabling the speech synthesis to be more flexible; performing vector conversion on the text phoneme sequence to obtain a text matrix; vector splicing is carried out on the standard voice vector and the text matrix to obtain a target matrix, so that the flexible combination of the characteristics of the voice and the characteristics of the text to be synthesized is realized, and the flexible synthesis of the subsequent voice is ensured; extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio. Therefore, the speech synthesis method, the speech synthesis device, the electronic equipment and the computer-readable storage medium provided by the embodiment of the invention improve the flexibility of speech synthesis.
Drawings
Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of obtaining a target spectrogram in a speech synthesis method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating a process of obtaining a standard speech vector in a speech synthesis method according to an embodiment of the present invention;
FIG. 4 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;
fig. 5 is a schematic internal structural diagram of an electronic device implementing a speech synthesis method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a voice synthesis method. The execution subject of the speech synthesis method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the speech synthesis method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, which is a schematic flow diagram of a speech synthesis method according to an embodiment of the present invention, in an embodiment of the present invention, the speech synthesis method includes:
s1, obtaining a sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
in the embodiment of the present invention, the sample audio is the voice data of the target speaker to be obtained subsequently, such as: the subsequent text is synthesized into the speech of speaker a, then the sample audio is the number of speakers a's speech.
Further, in order to make the speech synthesis of the subsequent text more accurate, the invention performs feature extraction processing on the sample audio to obtain the standard speech vector.
Because the voice data capacity is large and is not easy to process, the sample audio is subjected to sound feature extraction and conversion to obtain a spectrogram.
In detail, in the embodiment of the present invention, referring to fig. 2, the performing sound feature extraction and conversion on the sample audio to obtain a target spectrogram includes:
s11, resampling the sample audio to obtain a digital voice signal;
in the embodiment of the present invention, in order to facilitate data processing on the sample audio, the sample audio is resampled to obtain the digital voice signal, and preferably, the embodiment of the present invention uses a digital-to-analog converter to resample the sample audio.
S12, pre-emphasizing the digital voice signal to obtain a standard digital voice signal;
in detail, the embodiment of the present invention performs the pre-emphasis operation by using the following formula:
y(t)=x(t)-μx(t-1)
wherein x (t) is the digital speech signal, t is time, y (t) is the standard digital speech signal, and μ is a preset adjustment value of the pre-emphasis operation, and preferably, μ has a value range of [0.9,1.0 ].
S13, performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram;
in the embodiment of the invention, the standard digital voice signal can only reflect the change of the audio frequency in the time domain and cannot reflect the audio frequency characteristics of the standard voice signal, and in order to reflect the audio frequency characteristics of the standard voice signal, the audio frequency characteristics are more visual and clear, and the standard digital voice signal is subjected to characteristic conversion.
In detail, in the embodiment of the present invention, the performing feature conversion on the standard digital speech signal includes: and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram. Preferably, the sound processing algorithm in the embodiment of the present invention is a mel filtering algorithm.
Further, in order to further simplify and utilize data and improve data processing efficiency, an embodiment of the present invention performs vectorization processing on the target spectrogram, including: and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector. Preferably, in an embodiment of the present invention, the pre-constructed image classification model is a residual error network model trained by using a historical spectrogram atlas, where the historical spectrogram atlas is a plurality of spectrogram collections with the same types and different contents as the target spectrogram.
In detail, in the embodiment of the present invention, referring to fig. 3, the extracting features of the target spectrogram by using the pre-constructed image classification model to obtain the standard speech vector includes:
s21, obtaining the output of all nodes of the full-link layer contained in the image classification model to obtain a target spectrogram feature value set;
for example: the total connection layer of the image classification model comprises 1000 nodes, a target spectrogram T is input into the image classification model, 1000 node output values are obtained, and a target spectrogram feature value set of the target spectrogram T is obtained, wherein the output of each node is one feature value of the target spectrogram T, so that the target spectrogram feature value set of the target spectrogram T has 1000 feature values in total.
S22, longitudinally combining the characteristic values in the target spectrogram characteristic value set according to the sequence of all the nodes of the full connection layer to obtain a standard voice vector;
for example: the full connection layer is provided with 3 nodes which are respectively a first node, a second node and a third node in sequence, the target spectrogram feature value set of the target spectrogram A is provided with 3 feature values of 3,5 and 1, wherein the feature value 1 is output of the first node, the feature value 3 is output of the second node and the feature value 5 is output of the third node, and the three feature values in the target spectrogram feature value set of the target spectrogram A are longitudinally combined according to the node sequence to obtain a standard voice vector of the target spectrogram A
Figure BDA0002830612060000061
S2, when receiving a text to be synthesized, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;
in the embodiment of the present invention, the text to be synthesized is a text requiring speech synthesis, and phonemes of pronunciations of texts with different speeches may be represented by a general phonetic symbol rule.
In detail, in the embodiment of the present invention, the performing phoneme conversion on the text to be synthesized to obtain the text phoneme sequence includes: performing punctuation deletion on the text to be synthesized to obtain a standard text; marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence, wherein the preset phonetic symbol rule comprises the following steps: the preset phonetic symbol rule is an international phonetic symbol rule, the corresponding phoneme marked with a character "o" is a, and the obtained text phoneme sequence is [ a ].
S3, carrying out vector conversion on the text phoneme sequence to obtain a text matrix;
in the embodiment of the invention, each phoneme in the text phoneme sequence is converted into a column vector by using an onehot coding algorithm, so as to obtain the text matrix.
S4, carrying out vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
in detail, in the embodiment of the present invention, in order to perform speech synthesis better subsequently, it is further required to determine that each phoneme in the text phoneme sequence performs speech alignment, that is, to determine a pronunciation duration, that is, a phoneme frame length, of each phoneme in the text phoneme sequence, so that in the embodiment of the present invention, a preset algorithm model is used to calculate the phoneme frame length of each phoneme in the text phoneme sequence to obtain a phoneme frame length sequence, where the preset algorithm model in the embodiment of the present invention may be a DNN-HMM network model.
Further, in the embodiment of the present invention, the phoneme frame length sequence is converted into a phoneme frame length vector, that is, the phoneme frame length sequence is converted into a corresponding row vector to obtain the phoneme frame length vector, and the phoneme frame length vector and the text matrix are transversely spliced to obtain the standard text matrix, for example: the phoneme frame length vector is a row vector of 1 × 4, the text matrix is a matrix of 5 × 4, and the phoneme frame length vector is used as a fifth row of the text matrix to obtain the standard text matrix of 6 × 4.
In detail, in the embodiment of the present invention, each of the standard speech vector and the standard text matrix is divided into two partsOne column is longitudinally spliced to obtain the target matrix, for example: the standard text matrix is
Figure BDA0002830612060000071
The standard speech vector is
Figure BDA0002830612060000072
Longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix of
Figure BDA0002830612060000073
S5, extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;
in order to further perform speech synthesis, the embodiment of the present invention further needs to determine a spectral feature of the target matrix, where the spectral feature may be Mel-frequency spectrum.
In detail, in the embodiment of the present invention, the trained acoustic model is used to perform spectrum feature extraction on the target matrix, so as to obtain the spectrum feature extraction. Preferably, the acoustic model may be a transform model.
Further, before extracting the spectral feature of the target matrix by using the trained acoustic model, the method further includes: acquiring a historical text matrix set; performing frequency spectrum characteristic information marking on each historical text matrix of the historical text matrix set to obtain a training set; and training the acoustic model by using the training set until the acoustic model converges to obtain the trained acoustic model. The historical text matrix set is a set of a plurality of historical text matrixes, and the historical text matrixes are target matrixes corresponding to texts different from the texts to be synthesized.
In another embodiment of the present invention, in order to ensure the privacy of data, the spectrum feature information may be stored in a block link point.
And S6, performing voice synthesis on the spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
In detail, in the embodiment of the present invention, the spectral feature information is input to a preset vocoder, so as to obtain the synthesized audio.
Preferably, the vocoder is a WORLD vocoder.
Fig. 4 is a functional block diagram of the speech synthesis apparatus according to the present invention.
The speech synthesis apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the speech synthesis apparatus may include an audio processing module 101, a word processing module 102, and a speech synthesis module 103, which may also be referred to as a unit, and refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform fixed functions, and are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the audio processing module 101 is configured to obtain a sample audio, perform sound feature extraction conversion and vectorization processing on the sample audio, and obtain a standard speech vector.
In the embodiment of the present invention, the sample audio is the voice data of the target speaker to be obtained subsequently, such as: the subsequent text is synthesized into the speech of speaker a, then the sample audio is the number of speakers a's speech.
Further, in order to make the speech synthesis of the subsequent text more accurate, the audio processing module 101 performs feature extraction processing on the sample audio to obtain the standard speech vector.
Because the volume of the voice data is large and is not easy to process, the audio processing module 101 performs sound feature extraction and conversion on the sample audio to obtain a target spectrogram.
In detail, in the embodiment of the present invention, the audio processing module 101 performs sound feature extraction and conversion on the sample audio by using the following means to obtain a target spectrogram, including:
resampling the sample audio to obtain a digital voice signal;
in the embodiment of the present invention, in order to facilitate data processing on the sample audio, the sample audio is resampled to obtain the digital voice signal, and preferably, the embodiment of the present invention uses a digital-to-analog converter to resample the sample audio.
Pre-emphasis is carried out on the digital voice signal to obtain a standard digital voice signal;
in detail, the embodiment of the present invention performs the pre-emphasis operation by using the following formula:
y(t)=x(t)-μx(t-1)
wherein x (t) is the digital speech signal, t is time, y (t) is the standard digital speech signal, and μ is a preset adjustment value of the pre-emphasis operation, and preferably, μ has a value range of [0.9,1.0 ].
Performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram;
in the embodiment of the invention, the standard digital voice signal can only reflect the change of the audio frequency in the time domain and cannot reflect the audio frequency characteristics of the standard voice signal, and in order to reflect the audio frequency characteristics of the standard voice signal, the audio frequency characteristics are more visual and clear, and the standard digital voice signal is subjected to characteristic conversion.
In detail, in the embodiment of the present invention, the performing, by the audio processing module 101, feature conversion on the standard digital voice signal includes: and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram. Preferably, the sound processing algorithm in the embodiment of the present invention is a mel filtering algorithm.
Further, in order to further simplify and utilize data and improve data processing efficiency, the audio processing module 101 of the embodiment of the present invention performs vectorization processing on the target spectrogram, including: and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector. Preferably, in an embodiment of the present invention, the pre-constructed image classification model is a residual error network model trained by using a historical spectrogram atlas, where the historical spectrogram atlas is a plurality of spectrogram collections with the same types and different contents as the target spectrogram.
In detail, in the embodiment of the present invention, the extracting features of the target spectrogram by the audio processing module 101 by using the following means to obtain the standard speech vector includes:
acquiring the output of all nodes of a full connection layer contained in the image classification model to obtain a target spectrogram characteristic value set;
for example: the total connection layer of the image classification model comprises 1000 nodes, a target spectrogram T is input into the image classification model, 1000 node output values are obtained, and a target spectrogram feature value set of the target spectrogram T is obtained, wherein the output of each node is one feature value of the target spectrogram T, so that the target spectrogram feature value set of the target spectrogram T has 1000 feature values in total.
According to the sequence of all nodes of the full-connection layer, longitudinally combining the characteristic values in the target spectrogram characteristic value set to obtain a standard voice vector;
for example: the full connection layer is provided with 3 nodes which are respectively a first node, a second node and a third node in sequence, the target spectrogram feature value set of the target spectrogram A is provided with 3 feature values of 3,5 and 1, wherein the feature value 1 is output of the first node, the feature value 3 is output of the second node and the feature value 5 is output of the third node, and the three feature values in the target spectrogram feature value set of the target spectrogram A are longitudinally combined according to the node sequence to obtain a standard voice vector of the target spectrogram A
Figure BDA0002830612060000091
The text processing module 102 is configured to, when receiving a text to be synthesized, perform phoneme conversion on the text to be synthesized to obtain a text phoneme sequence; performing vector conversion on the text phoneme sequence to obtain a text matrix; and carrying out vector splicing on the standard voice vector and the text matrix to obtain a target matrix.
In the embodiment of the present invention, the text to be synthesized is a text requiring speech synthesis, and phonemes of pronunciations of texts with different speeches may be represented by a general phonetic symbol rule.
In detail, in this embodiment of the present invention, the performing, by the text processing module 102, phoneme conversion on the text to be synthesized to obtain the text phoneme sequence includes: performing punctuation deletion on the text to be synthesized to obtain a standard text; marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence, wherein the preset phonetic symbol rule comprises the following steps: the preset phonetic symbol rule is an international phonetic symbol rule, the corresponding phoneme marked with a character "o" is a, and the obtained text phoneme sequence is [ a ].
In this embodiment of the present invention, the text processing module 102 converts each phoneme in the text phoneme sequence into a column vector by using an onehot coding algorithm, so as to obtain the text matrix.
In detail, in this embodiment of the present invention, in order to perform speech synthesis better subsequently, it is further required to determine that each phoneme in the text phoneme sequence performs speech alignment, that is, to determine a pronunciation duration, that is, a phoneme frame length, of each phoneme in the text phoneme sequence, so that in this embodiment of the present invention, the text processing module 102 calculates the phoneme frame length of each phoneme in the text phoneme sequence by using a preset algorithm model to obtain a phoneme frame length sequence, where the preset algorithm model may be a DNN-HMM network model.
Further, in this embodiment of the present invention, the text processing module 102 converts the phoneme frame length sequence into a phoneme frame length vector, that is, converts the phoneme frame length sequence into a corresponding row vector to obtain the phoneme frame length vector, and transversely splices the phoneme frame length vector and the text matrix to obtain the standard text matrix, for example: the phoneme frame length vector is a row vector of 1 × 4, the text matrix is a matrix of 5 × 4, and the phoneme frame length vector is used as a fifth row of the text matrix to obtain the standard text matrix of 6 × 4.
In detail, in the embodiment of the present invention, the text processing module 102 performs vertical concatenation on the standard speech vector and each column of the standard text matrix to obtain the target matrix, for example: the standard text matrix is
Figure BDA0002830612060000101
The standard speech vector is
Figure BDA0002830612060000102
Longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix of
Figure BDA0002830612060000103
The voice synthesis module 103 is configured to perform spectrum feature extraction on the target matrix to obtain spectrum feature information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
In order to further perform speech synthesis, the embodiment of the present invention further needs to determine a spectral feature of the target matrix, where the spectral feature may be Mel-frequency spectrum.
In detail, in the embodiment of the present invention, the trained acoustic model is used to perform spectrum feature extraction on the target matrix, so as to obtain the spectrum feature extraction. Preferably, the acoustic model may be a transform model.
Further, before the speech synthesis module 103 extracts the spectral feature of the target matrix by using the trained acoustic model in the embodiment of the present invention, the method further includes: acquiring a historical text matrix set; performing frequency spectrum characteristic information marking on each historical text matrix of the historical text matrix set to obtain a training set; and training the acoustic model by using the training set until the acoustic model converges to obtain the trained acoustic model. The historical text matrix set is a set of a plurality of historical text matrixes, and the historical text matrixes are target matrixes corresponding to texts different from the texts to be synthesized.
In another embodiment of the present invention, in order to ensure the privacy of data, the spectrum feature information may be stored in a block link point.
In detail, in the embodiment of the present invention, the speech synthesis module 103 inputs the spectrum feature information to a preset vocoder to obtain the synthesized audio.
Preferably, the vocoder is a WORLD vocoder.
Fig. 5 is a schematic structural diagram of an electronic device implementing the speech synthesis method according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a speech synthesis program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a speech synthesis program, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a speech synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a PerIPheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The speech synthesis program 12 stored in the memory 11 of the electronic device 1 is a combination of computer programs that, when executed in the processor 10, enable:
acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;
performing vector conversion on the text phoneme sequence to obtain a text matrix;
performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;
and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
Embodiments of the present invention may also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of an electronic device, the computer program may implement:
acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;
performing vector conversion on the text phoneme sequence to obtain a text matrix;
performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;
and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of speech synthesis, the method comprising:
acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
when a text to be synthesized is received, carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence;
performing vector conversion on the text phoneme sequence to obtain a text matrix;
performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information;
and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
2. The speech synthesis method of claim 1, wherein the performing the acoustic feature extraction conversion and the vectorization process on the sample audio to obtain a standard speech vector comprises:
carrying out sound feature extraction and conversion on the sample audio to obtain a target spectrogram;
and performing feature extraction on the target spectrogram by using a pre-constructed image classification model to obtain the standard voice vector.
3. The speech synthesis method of claim 2, wherein the performing acoustic feature extraction conversion on the sample audio to obtain a target spectrogram comprises:
resampling the sample audio to obtain a digital voice signal;
pre-emphasis is carried out on the digital voice signal to obtain a standard digital voice signal;
and performing characteristic conversion on the standard digital voice signal to obtain the target spectrogram.
4. The speech synthesis method of claim 2, wherein the extracting features of the target spectrogram by using the pre-constructed image classification model to obtain the standard speech vector comprises:
acquiring the output of all nodes of a full connection layer contained in the image classification model to obtain a target spectrogram characteristic value set;
and longitudinally combining the characteristic values in the target spectrogram characteristic value set according to the sequence of all the nodes of the full connection layer to obtain a standard voice vector.
5. The speech synthesis method of claim 3, wherein said performing feature conversion on said standard digital speech signal to obtain said target spectrogram comprises:
and mapping the standard digital voice signal in a frequency domain by using a preset voice processing algorithm to obtain the target spectrogram.
6. The speech synthesis method of claim 1, wherein the vector-splicing the standard speech vector with the text matrix to obtain a target matrix comprises:
calculating the phoneme frame length of each phoneme in the text phoneme sequence by using a preset algorithm model to obtain a phoneme frame length sequence;
converting the phoneme frame length sequence into a phoneme frame length vector;
transversely splicing the phoneme frame length vector and the text matrix to obtain a standard text matrix;
and longitudinally splicing the standard voice vector and each column of the standard text matrix to obtain the target matrix.
7. The speech synthesis method according to any one of claims 1 to 6, wherein the performing phoneme conversion on the text to be synthesized to obtain a text phoneme sequence comprises:
performing punctuation deletion on the text to be synthesized to obtain a standard text;
and marking the phoneme corresponding to each character in the standard text by using a preset phonetic symbol rule to obtain the text phoneme sequence.
8. A speech synthesis apparatus, comprising:
the audio processing module is used for acquiring sample audio, and performing sound feature extraction conversion and vectorization processing on the sample audio to obtain a standard voice vector;
the text processing module is used for carrying out phoneme conversion on the text to be synthesized to obtain a text phoneme sequence when the text to be synthesized is received; performing vector conversion on the text phoneme sequence to obtain a text matrix; performing vector splicing on the standard voice vector and the text matrix to obtain a target matrix;
the voice synthesis module is used for extracting the frequency spectrum characteristic of the target matrix to obtain frequency spectrum characteristic information; and performing voice synthesis on the frequency spectrum characteristic information by using a preset vocoder to obtain a synthesized audio.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the speech synthesis method of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech synthesis method according to any one of claims 1 to 7.
CN202011442571.2A 2020-12-11 2020-12-11 Speech synthesis method, device, electronic equipment and readable storage medium Pending CN112397047A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011442571.2A CN112397047A (en) 2020-12-11 2020-12-11 Speech synthesis method, device, electronic equipment and readable storage medium
PCT/CN2021/083824 WO2022121176A1 (en) 2020-12-11 2021-03-30 Speech synthesis method and apparatus, electronic device, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011442571.2A CN112397047A (en) 2020-12-11 2020-12-11 Speech synthesis method, device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112397047A true CN112397047A (en) 2021-02-23

Family

ID=74625646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011442571.2A Pending CN112397047A (en) 2020-12-11 2020-12-11 Speech synthesis method, device, electronic equipment and readable storage medium

Country Status (2)

Country Link
CN (1) CN112397047A (en)
WO (1) WO2022121176A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927677A (en) * 2021-03-29 2021-06-08 北京大米科技有限公司 Speech synthesis method and device
CN113096625A (en) * 2021-03-24 2021-07-09 平安科技(深圳)有限公司 Multi-person Buddha music generation method, device, equipment and storage medium
CN113327578A (en) * 2021-06-10 2021-08-31 平安科技(深圳)有限公司 Acoustic model training method and device, terminal device and storage medium
CN113436608A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Double-stream voice conversion method, device, equipment and storage medium
WO2022121176A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Speech synthesis method and apparatus, electronic device, and readable storage medium
CN114783406A (en) * 2022-06-16 2022-07-22 深圳比特微电子科技有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11769481B2 (en) 2021-10-07 2023-09-26 Nvidia Corporation Unsupervised alignment for text to speech synthesis using neural networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4966048B2 (en) * 2007-02-20 2012-07-04 株式会社東芝 Voice quality conversion device and speech synthesis device
US10186252B1 (en) * 2015-08-13 2019-01-22 Oben, Inc. Text to speech synthesis using deep neural network with constant unit length spectrogram
CN111161702B (en) * 2019-12-23 2022-08-26 爱驰汽车有限公司 Personalized speech synthesis method and device, electronic equipment and storage medium
CN112002305B (en) * 2020-07-29 2024-06-18 北京大米科技有限公司 Speech synthesis method, device, storage medium and electronic equipment
CN112397047A (en) * 2020-12-11 2021-02-23 平安科技(深圳)有限公司 Speech synthesis method, device, electronic equipment and readable storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121176A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Speech synthesis method and apparatus, electronic device, and readable storage medium
CN113096625A (en) * 2021-03-24 2021-07-09 平安科技(深圳)有限公司 Multi-person Buddha music generation method, device, equipment and storage medium
CN112927677A (en) * 2021-03-29 2021-06-08 北京大米科技有限公司 Speech synthesis method and device
CN113327578A (en) * 2021-06-10 2021-08-31 平安科技(深圳)有限公司 Acoustic model training method and device, terminal device and storage medium
CN113327578B (en) * 2021-06-10 2024-02-02 平安科技(深圳)有限公司 Acoustic model training method and device, terminal equipment and storage medium
CN113436608A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Double-stream voice conversion method, device, equipment and storage medium
CN113436608B (en) * 2021-06-25 2023-11-28 平安科技(深圳)有限公司 Double-flow voice conversion method, device, equipment and storage medium
CN114783406A (en) * 2022-06-16 2022-07-22 深圳比特微电子科技有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium
CN116705058B (en) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
WO2022121176A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN109686361B (en) Speech synthesis method, device, computing equipment and computer storage medium
CN112466273A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111862937A (en) Singing voice synthesis method, singing voice synthesis device and computer readable storage medium
CN113345431B (en) Cross-language voice conversion method, device, equipment and medium
CN112820269B (en) Text-to-speech method and device, electronic equipment and storage medium
CN112951203B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113096242A (en) Virtual anchor generation method and device, electronic equipment and storage medium
CN113205814B (en) Voice data labeling method and device, electronic equipment and storage medium
CN112509554A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN112233700A (en) Audio-based user state identification method and device and storage medium
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
CN113887200A (en) Text variable-length error correction method and device, electronic equipment and storage medium
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN112489628A (en) Voice data selection method and device, electronic equipment and storage medium
CN116564322A (en) Voice conversion method, device, equipment and storage medium
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN114842880A (en) Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
CN113160793A (en) Speech synthesis method, device, equipment and storage medium based on low resource language
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination