CN114783402B - Variation method and device for synthetic voice, electronic equipment and storage medium - Google Patents

Variation method and device for synthetic voice, electronic equipment and storage medium Download PDF

Info

Publication number
CN114783402B
CN114783402B CN202210707967.8A CN202210707967A CN114783402B CN 114783402 B CN114783402 B CN 114783402B CN 202210707967 A CN202210707967 A CN 202210707967A CN 114783402 B CN114783402 B CN 114783402B
Authority
CN
China
Prior art keywords
paragraph
actual
variation
natural
adjusting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210707967.8A
Other languages
Chinese (zh)
Other versions
CN114783402A (en
Inventor
余勇
钟少恒
王翊
王佳骏
陈志刚
陈捷
曹小冬
吴启明
蔡勇超
林承勋
吕华良
丁铖
林家树
郭泽豪
符春造
方美明
陈瑾
李鸿盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Power Supply Bureau of Guangdong Power Grid Corp
Original Assignee
Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Power Supply Bureau of Guangdong Power Grid Corp filed Critical Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority to CN202210707967.8A priority Critical patent/CN114783402B/en
Publication of CN114783402A publication Critical patent/CN114783402A/en
Application granted granted Critical
Publication of CN114783402B publication Critical patent/CN114783402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a variation method and device of synthesized voice, electronic equipment and a storage medium, which are used for solving the technical problems of unclear hierarchical sense and poor vividness of the existing synthesized voice. The invention comprises the following steps: acquiring a preprocessed text, and identifying natural paragraphs from the preprocessed text; adjusting the natural paragraph to obtain an actual paragraph; sequentially calculating the correlation between two adjacent actual paragraphs; generating a synthesized speech of the actual passage; acquiring the language rhythm of each actual paragraph in the synthesized voice; and adjusting the language rhythm according to the correlation to obtain variation synthesized voice.

Description

Variation method and device for synthetic voice, electronic equipment and storage medium
Technical Field
The present invention relates to the field of speech variation technologies, and in particular, to a synthetic speech variation method and apparatus, an electronic device, and a storage medium.
Background
The Speech synthesis, also known as Text to Speech (Text to Speech) technology, can convert any Text information into standard smooth Speech in real time for reading, and is equivalent to mounting an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert text information into audible sound information, namely, how to make a robot speak like a person. This is fundamentally different from the conventional sound playback apparatus (system). Conventional sound playback devices (systems), such as tape recorders, "let the machine speak" by prerecording the sound and then playing it back. But there are significant limitations to this approach, whether it be in content, storage, transmission, or convenience, timeliness, and so on. And any text can be converted into the speech with high naturalness at any time through computer speech synthesis, so that the machine can speak like a human.
In speech synthesis, selecting a basic tempo is one of the most important steps, and the basic tempo is generally determined by the key of the text. However, the basic tempo decided based on the key is relatively stable, and the speech hierarchy feeling by the speech synthesis performed thereby is not clear enough and is hard but not vivid.
Disclosure of Invention
The invention provides a variation method and device of synthesized voice, electronic equipment and a storage medium, which are used for solving the technical problems of unclear hierarchical sense and poor vividness of the existing synthesized voice.
The invention provides a variation method for synthesizing voice, which comprises the following steps: acquiring a preprocessed text, and identifying a natural paragraph from the preprocessed text; adjusting the natural paragraph to obtain an actual paragraph; sequentially calculating the correlation between two adjacent actual paragraphs; generating a synthesized speech of the actual passage; acquiring the language rhythm of each actual paragraph in the synthesized voice; and adjusting the language rhythm according to the correlation to obtain variation synthesized voice. Optionally, the step of obtaining a preprocessed text and identifying a natural paragraph from the preprocessed text includes: acquiring a preprocessed text, and identifying a line feed key in the preprocessed text; splitting the preprocessed text into a number of natural paragraphs based on the line feed key. Optionally, the step of adjusting the natural paragraph to obtain an actual paragraph includes: judging whether each natural paragraph has only one scene; if not, splitting the natural paragraph according to the scene to generate an actual paragraph; if yes, judging whether two adjacent natural sections are in the same scene; if so, merging two adjacent natural paragraphs of the same scene into the same actual paragraph. Optionally, the step of adjusting the language rhythm according to the correlation to obtain a variation synthesized speech includes: traversing all the actual paragraphs, and sequentially determining each actual paragraph as a current adjustment paragraph; when the correlation between the current adjusting paragraph and the last actual paragraph is larger than a first preset threshold value, adjusting the language rhythm of the current adjusting paragraph to obtain a variation paragraph; the variation paragraph has a plurality of sentences; determining sentence correlation of two adjacent sentences in the variation paragraphs; when the sentence correlation between the current sentence and the previous sentence is larger than a second preset threshold value, adjusting the language rhythm of the current sentence to obtain a sentence adjustment rhythm; and adjusting the rhythm by using sentences of each sentence in all the actual paragraphs to generate variation synthesized speech. Optionally, the step of adjusting the language rhythm of the current adjusted paragraph when the paragraph correlation between the current adjusted paragraph and the last actual paragraph is greater than a first preset threshold to obtain a variation paragraph includes: when the correlation between the current adjustment paragraph and the last actual paragraph is greater than a first preset threshold, obtaining a first paragraph adjustment index of the current actual paragraph and a second paragraph adjustment index of the last actual paragraph; comparing the first paragraph adjustment index and the second paragraph adjustment index to determine a target paragraph adjustment index; and adjusting the language rhythm of the current adjusted paragraph based on the target paragraph adjustment index to obtain a variation paragraph. The present invention also provides a variation apparatus for synthesizing speech, comprising: the natural paragraph identification module is used for acquiring the preprocessed text and identifying the natural paragraph from the preprocessed text; the actual paragraph obtaining module is used for adjusting the natural paragraph to obtain an actual paragraph; the correlation calculation module is used for sequentially calculating the correlation of two adjacent actual paragraphs; a synthesized speech generation module for generating synthesized speech of the actual paragraph; a language rhythm obtaining module, configured to obtain a language rhythm of each actual paragraph in the synthesized speech; and the variation module is used for adjusting the language rhythm according to the correlation to obtain variation synthesized voice. Optionally, the natural paragraph identification module includes: the line-change key identification submodule is used for acquiring a preprocessed text and identifying line-change keys in the preprocessed text; and the preprocessed text splitting sub-module is used for splitting the preprocessed text into a plurality of natural paragraphs based on the line feed key. Optionally, the actual paragraph obtaining module includes: the first scene judgment submodule is used for judging whether each natural paragraph has only one scene; a natural paragraph splitting submodule, configured to split the natural paragraph according to the scene if the natural paragraph is not split, and generate an actual paragraph; the second scene judging submodule is used for judging whether the two adjacent natural sections are in the same scene if the two adjacent natural sections are in the same scene; and the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph if the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph. The invention also provides an electronic device comprising a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to execute the variation method of the synthesized speech according to the instructions in the program code. The present invention also provides a computer-readable storage medium for storing a program code for executing the variation method of synthesized speech as described in any one of the above.
According to the technical scheme, the invention has the following advantages: the method comprises the steps of acquiring a preprocessed text and identifying a natural paragraph from the preprocessed text; adjusting the natural paragraph to obtain an actual paragraph; sequentially calculating the correlation between two adjacent actual paragraphs; generating a synthesized voice of the actual paragraph; acquiring the language rhythm of each actual paragraph in the synthesized voice; and adjusting the language rhythm according to the correlation to obtain variation synthesized voice. The language rhythm in the synthesized voice is adjusted, so that the synthesized voice is more clear and vivid in hierarchy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart illustrating steps of a variation method for synthesizing speech according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of a method for synthesizing variations of speech according to another embodiment of the present invention;
fig. 3 is a block diagram of a structure of a variation apparatus for synthesizing speech according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a variation method and device of synthesized voice, electronic equipment and a storage medium, which are used for solving the technical problems of unclear hierarchical sense and poor vividness of the existing synthesized voice.
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for synthesizing speech variation according to an embodiment of the present invention.
The variation method for synthesizing the voice provided by the invention specifically comprises the following steps:
step 101, acquiring a preprocessed text, and identifying a natural paragraph from the preprocessed text;
in the embodiment of the invention, after the preprocessed text which needs to be subjected to speech synthesis is obtained, paragraph recognition can be performed on the preprocessed text, and the preprocessed text is divided into a plurality of natural paragraphs.
Step 102, adjusting the natural paragraph to obtain an actual paragraph;
in practical applications, there are some cases where an author mistakenly segments, or segments change due to typesetting, so that the same paragraph is split or different paragraphs are merged, resulting in a discontinuous scene or emotional hierarchy of a paragraph, or a step with a different scene or emotional hierarchy that is greatly different. Therefore, after the natural paragraphs in the preprocessed text are obtained, the natural paragraphs can be adjusted to obtain the actual paragraphs with unique scene and emotion and complete emotion.
Step 103, calculating the correlation between two adjacent actual paragraphs in sequence;
in the embodiment of the present invention, after the actual paragraphs are divided, the correlations between two adjacent actual paragraphs may be sequentially calculated, so as to facilitate the subsequent judgment of whether to perform variation processing on the actual paragraphs.
Step 104, generating a synthetic voice of an actual paragraph;
after the actual passage segmentation is completed, the actual passage can be used to generate synthesized speech.
It should be noted that the embodiment of the present invention does not limit the method for synthesizing speech, and a person skilled in the art may select any speech synthesis method according to actual situations.
105, acquiring the language rhythm of each actual paragraph in the synthesized voice;
and step 106, adjusting the language rhythm according to the correlation to obtain variation synthesized voice.
In the embodiment of the invention, after the synthesized voice is generated, the language rhythm of each actual paragraph in the synthesized voice can be acquired, and then the language rhythm of the synthesized language is adjusted according to the correlation between adjacent actual paragraphs to obtain the variation synthesized voice.
In one example, the language cadence of the actual passage may include speech rate, pitch, volume, pause, etc.
The method comprises the steps of acquiring a preprocessed text and identifying a natural paragraph from the preprocessed text; adjusting the natural paragraph to obtain an actual paragraph; sequentially calculating the correlation between two adjacent actual paragraphs; generating a synthesized voice of the actual paragraph; acquiring the language rhythm of each actual paragraph in the synthesized voice; and adjusting the language rhythm according to the correlation to obtain variation synthesized voice. The language rhythm in the synthesized speech is adjusted, so that the synthesized speech is more clear and vivid in hierarchy.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for synthesizing speech variation according to another embodiment of the present invention. The method specifically comprises the following steps:
step 201, acquiring a preprocessed text, and identifying a line feed key in the preprocessed text;
step 202, splitting the preprocessed text into a plurality of natural paragraphs based on a line feed key;
in the embodiment of the invention, the natural paragraphs in the preprocessed text can be distinguished by retrieving the line feed key in the preprocessed text. Step 203, adjusting the natural paragraph to obtain an actual paragraph;
in practical applications, there are some cases where an author mistakenly segments, or segments change due to typesetting, so that the same paragraph is split or different paragraphs are merged, resulting in a discontinuous scene or emotional hierarchy of a paragraph, or a step with a different scene or emotional hierarchy that is greatly different. Therefore, after the natural paragraphs in the pre-processed text are obtained, the natural paragraphs can be adjusted to obtain a unique and complete actual paragraph with a scene and an emotion.
In an example, the step of adjusting the natural paragraph to obtain the actual paragraph may specifically include the following sub-steps: s31, judging whether each natural paragraph has only one scene; s32, if not, splitting the natural paragraph according to the scene to generate an actual paragraph;
in a specific implementation, the scene determination may be performed by a pre-trained first neural network model. Wherein, the first neural network model can be obtained by the following training processes: a large number of paragraphs are obtained as sample data, one part is a paragraph with one scene, the other part is a paragraph with more than one scene, and the number of scenes is marked for each paragraph. And taking the paragraphs as input and the number of scenes as output to train a first neural network model.
And inputting the natural paragraphs obtained from the preprocessed texts into the trained first neural network model, and judging the number of scenes in the natural paragraphs according to the output result.
When more than one scene in the natural paragraph is available, the natural paragraph is split, and the split paragraph is used as an actual paragraph.
S33, if yes, judging whether two adjacent natural sections are in the same scene; and S34, if yes, merging two adjacent natural paragraphs of the same scene into the same actual paragraph. When only one scene exists in the natural paragraphs, whether the two natural paragraphs are the same scene or not can be judged through the trained second neural network model. Wherein, the second neural network model can be obtained by the following training processes: a large number of paragraphs are obtained as sample data, and whether every two paragraphs are in the same scene is marked. And taking every two paragraphs as input and taking whether the two paragraphs are the same scene as output, and training a second neural network model.
And inputting two adjacent natural paragraphs only containing one scene into the trained second neural network model, and judging whether the two natural paragraphs are in the same scene according to an output result. If so, the two natural paragraphs are merged into one actual paragraph.
It should be noted that besides the scene, those skilled in the art may also adopt other elements as the dividing basis of the actual paragraphs, such as the sentiment level and the like. The embodiment of the present invention is not particularly limited thereto. Step 204, calculating the correlation between two adjacent actual paragraphs in sequence;
in the embodiment of the present invention, after the actual paragraphs are divided, the correlations between two adjacent actual paragraphs may be sequentially calculated, so as to facilitate the subsequent judgment of whether to perform variation processing on the actual paragraphs.
In a specific implementation, the correlation of two adjacent actual paragraphs can be analyzed by a neural network model. Step 205, generating a synthesized voice of the actual paragraph; after the actual paragraph segmentation is completed, the actual paragraph may be used to generate synthesized speech.
It should be noted that the embodiment of the present invention does not limit the method for synthesizing speech, and a person skilled in the art may select any speech synthesis method according to actual situations.
Step 206, acquiring the language rhythm of each actual paragraph in the synthesized voice; and step 207, adjusting the language rhythm according to the correlation to obtain variation synthesized voice. In the embodiment of the invention, after the synthesized voice is generated, the language rhythm of each actual paragraph in the synthesized voice can be acquired, and then the language rhythm of the synthesized language is adjusted according to the correlation between adjacent actual paragraphs to obtain the variation synthesized voice.
In one example, the language cadence of the actual passage may include speech rate, pitch, volume, pause, etc.
In an example, the step of adjusting the language rhythm according to the correlation to obtain the variation synthesized speech may specifically include the following sub-steps: s71, traversing all the actual paragraphs, and determining each actual paragraph as a current adjustment paragraph in turn; s72, when the correlation between the current adjusting paragraph and the last actual paragraph is larger than the first preset threshold, adjusting the language rhythm of the current adjusting paragraph to obtain a variation paragraph; the variation paragraph has a plurality of sentences;
in the embodiment of the present invention, when the correlation between the current adjusted paragraph and the last actual paragraph is greater than a first preset threshold (e.g. 80%), the scenes or emotional levels of the two paragraphs may be considered to be approximate. The basic rhythm of the paragraph is approximate, if not adjusted, the rhythm of the synthesized voice is very stable, and the continuously stable voice appears emotional and not vivid enough. Therefore, in this case, the currently adjusted paragraph can be varied to obtain a varied paragraph, so that the voice is more vivid.
In an example, when the paragraph correlation between the current adjusted paragraph and the last actual paragraph is greater than a first preset threshold, the step of adjusting the language rhythm of the current adjusted paragraph to obtain a variation paragraph may specifically include:
s721, when the correlation between the current paragraph and the previous paragraph is greater than the first preset threshold, obtaining the first paragraph adjustment index of the current paragraph and the second paragraph adjustment index of the previous paragraph;
s722, comparing the first paragraph adjustment index with the second paragraph adjustment index, and determining a target paragraph adjustment index;
and S723, adjusting the language rhythm of the current adjusted paragraph based on the target paragraph adjustment index to obtain a variation paragraph.
In particular implementations, since the language rhythm may include speech rate, pitch, volume, and the like. Therefore, the speed, pitch and volume of speech can be used as the adjustment indexes. The speech rate, pitch and volume of the current adjusted paragraph are the first adjustment indexes, and the speech rate, pitch and volume of the last actual paragraph are the second adjustment indexes.
And comparing the first adjustment index with the second adjustment index to judge which similarity among the speech rate, the tone and the volume of the first adjustment index and the second adjustment index is larger. And taking the maximum similarity as a target adjustment index.
For example, when the similarity between the speech rate of the current adjusted paragraph and the speech rate of the previous adjusted paragraph is the greatest, the speech rate of the current adjusted paragraph can be adjusted. Such as speeding up or slowing down the speech rate of the currently adjusted paragraph.
It should be noted that, from the perspective of auditory effect, too fast or too slow speech rate will result in the decrease of auditory effect. Therefore, in the embodiment of the present invention, the upper and lower limit thresholds may be set for the speech rate. In the process of adjusting the speech rate, when the speech rate of the currently adjusted paragraph approaches the lower threshold, the speech rate of the currently adjusted paragraph may be increased, and when the speech rate of the currently adjusted paragraph approaches the upper threshold, the speech rate of the currently adjusted paragraph may be decreased.
Similarly, the pitch and volume may also be set to upper and lower threshold values to avoid the reduction of the auditory effect caused by the adjustment of the pitch and volume.
It should be noted that, for the adjustment of the language rhythm of the currently adjusted paragraph, each sentence in the whole paragraph may be adjusted according to the same adjustment ratio, for example, the speed of speech is reduced by 20% as a whole, so as to obtain the adjusted variation paragraph. S73, determining sentence relevance of two adjacent sentences in the variation paragraph; s74, when the sentence relevance between the current sentence and the last sentence is larger than a second preset threshold value, adjusting the language rhythm of the current sentence to obtain a sentence adjustment rhythm; and S75, adjusting the rhythm by using sentences of each sentence in all the actual paragraphs to generate variation synthesized voice. The above is only to change the whole language rhythm of the same paragraph, and the language rhythm of sentences is different from sentence to sentence in the changed variant paragraph. Therefore, in the embodiment of the present invention, after the adjustment of the currently adjusted paragraph is completed, each sentence in the adjusted variation paragraph may be adjusted.
Firstly, sentence similarity between two adjacent sentences is obtained; and then judging whether the sentence similarity between the current sentence and the previous sentence is greater than a second preset threshold value, if so, representing that the language rhythms of the two sentences are similar, and at the moment, adjusting the language rhythm of the current sentence. For the adjustment of the language rhythm of the current sentence, one of the speed, the volume and the pitch may be adjusted, and the adjusting process of the speed, the volume and the pitch of the current paragraph may be referred to specifically, and will not be described herein again.
After the adjustment of the language rhythm of each sentence is completed, the rhythm can be adjusted by using the sentence of each sentence, and variation synthesized speech is generated.
The method comprises the steps of acquiring a preprocessed text and identifying a natural paragraph from the preprocessed text; adjusting the natural paragraph to obtain an actual paragraph; sequentially calculating the correlation between two adjacent actual paragraphs; generating a synthesized voice of the actual paragraph; acquiring the language rhythm of each actual paragraph in the synthesized voice; and adjusting the language rhythm according to the correlation to obtain variation synthesized voice. The language rhythm in the synthesized voice is adjusted, so that the synthesized voice is more clear and vivid in hierarchy.
Referring to fig. 3, fig. 3 is a block diagram of a structure of a speech synthesis variation apparatus according to an embodiment of the present invention.
The embodiment of the invention provides a variation device for synthesizing voice, which comprises:
a natural paragraph identification module 301, configured to obtain a preprocessed text and identify a natural paragraph from the preprocessed text;
an actual paragraph obtaining module 302, configured to adjust the natural paragraph to obtain an actual paragraph;
a correlation calculation module 303, configured to calculate correlations between two adjacent actual paragraphs in sequence;
a synthesized speech generation module 304 for generating a synthesized speech of the actual paragraph;
a language rhythm obtaining module 305, configured to obtain a language rhythm of each actual paragraph in the synthesized speech;
and the variation module 306 is configured to adjust the language rhythm according to the correlation, so as to obtain a variation synthesized voice.
In this embodiment of the present invention, the natural paragraph identification module 301 includes:
the line-changing key identification submodule is used for acquiring the preprocessed text and identifying line-changing keys in the preprocessed text;
and the preprocessed text splitting sub-module is used for splitting the preprocessed text into a plurality of natural paragraphs based on the line feed key.
In this embodiment of the present invention, the actual paragraph obtaining module 302 includes:
the first scene judging submodule is used for judging whether each natural paragraph only has one scene;
the natural paragraph splitting submodule is used for splitting the natural paragraph according to the scene to generate an actual paragraph if the natural paragraph is not split;
the second scene judging submodule is used for judging whether the two adjacent natural sections are in the same scene if the two adjacent natural sections are in the same scene;
and the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph if the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph.
In the embodiment of the present invention, the variation module 306 includes:
the traversal submodule is used for traversing all the actual paragraphs and determining each actual paragraph as a current adjustment paragraph in sequence;
the variation paragraph acquisition sub-module is used for adjusting the language rhythm of the current adjustment paragraph to obtain a variation paragraph when the correlation between the current adjustment paragraph and the last actual paragraph is greater than a first preset threshold; the variation paragraph has a plurality of sentences;
the sentence correlation determination submodule is used for determining the sentence correlation of two adjacent sentences in the variation paragraph;
the sentence adjusting rhythm obtaining submodule is used for adjusting the language rhythm of the current sentence to obtain a sentence adjusting rhythm when the sentence correlation between the current sentence and the previous sentence is larger than a second preset threshold;
and the variation synthesis voice generation submodule is used for adjusting the rhythm by adopting sentences of each sentence in all the actual paragraphs to generate variation synthesis voice.
In an embodiment of the present invention, the variation paragraph obtaining sub-module includes:
a paragraph adjustment index obtaining unit, configured to obtain a first paragraph adjustment index of a current actual paragraph and a second paragraph adjustment index of a previous actual paragraph when a correlation between the current adjustment paragraph and the previous actual paragraph is greater than a first preset threshold;
a target paragraph adjustment index determining unit, configured to compare the first paragraph adjustment index and the second paragraph adjustment index, and determine a target paragraph adjustment index;
and the variation paragraph acquisition unit is used for adjusting the language rhythm of the current adjusted paragraph based on the target paragraph adjustment index to obtain the variation paragraph.
An embodiment of the present invention further provides an electronic device, where the device includes a processor and a memory:
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is used for executing the variation method of the synthesized voice according to the instruction in the program code.
Embodiments of the present invention also provide a computer-readable storage medium for storing a program code for executing the variation method of synthesized speech according to the embodiments of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A variation method of synthesizing speech, comprising:
acquiring a preprocessed text, and identifying natural paragraphs from the preprocessed text;
adjusting the natural paragraph to obtain an actual paragraph;
sequentially calculating the correlation between two adjacent actual paragraphs;
generating a synthesized speech of the actual passage;
acquiring the language rhythm of each actual paragraph in the synthesized voice;
adjusting the language rhythm according to the correlation to obtain variation synthesized voice;
wherein, the step of adjusting the language rhythm according to the correlation to obtain variation synthesized voice comprises:
traversing all the actual paragraphs, and sequentially determining each actual paragraph as a current adjustment paragraph;
when the correlation between the current adjusting paragraph and the last actual paragraph is larger than a first preset threshold value, adjusting the language rhythm of the current adjusting paragraph to obtain a variation paragraph; the variation paragraph has a plurality of sentences;
determining sentence relevance of two adjacent sentences in the variation paragraph;
when the sentence correlation between the current sentence and the previous sentence is larger than a second preset threshold value, adjusting the language rhythm of the current sentence to obtain a sentence adjustment rhythm;
adjusting the rhythm by adopting sentences of each sentence in all the actual paragraphs to generate variation synthesized speech;
when the paragraph correlation between the current adjusted paragraph and the last actual paragraph is greater than a first preset threshold, the step of adjusting the language rhythm of the current adjusted paragraph to obtain a variation paragraph includes:
when the correlation between the current adjustment paragraph and the last actual paragraph is greater than a first preset threshold, obtaining a first paragraph adjustment index of the current actual paragraph and a second paragraph adjustment index of the last actual paragraph; the first paragraph adjustment index is the speech speed, the tone and the volume of the current adjustment paragraph; the second paragraph adjustment indexes are the speech speed, the tone and the volume of the last practical paragraph;
comparing the first paragraph adjustment index with the second paragraph adjustment index, and determining that the similarity among the speech speed, the tone and the volume of the current adjustment paragraph and the last actual paragraph is a target paragraph adjustment index;
and adjusting the language rhythm of the current adjusted paragraph based on the target paragraph adjustment index to obtain a variation paragraph.
2. The method of claim 1, wherein the step of obtaining pre-processed text and identifying natural paragraphs from the pre-processed text comprises:
acquiring a preprocessed text, and identifying a line feed key in the preprocessed text;
splitting the preprocessed text into a number of natural paragraphs based on the line feed key.
3. The method of claim 1, wherein the step of adjusting the natural passage to obtain an actual passage comprises:
judging whether each natural paragraph has only one scene;
if not, splitting the natural paragraphs according to the scene to generate actual paragraphs;
if yes, judging whether two adjacent natural sections are in the same scene;
if so, merging two adjacent natural paragraphs of the same scene into the same actual paragraph.
4. A variation apparatus for synthesizing speech, comprising:
the natural paragraph identification module is used for acquiring the preprocessed text and identifying the natural paragraph from the preprocessed text;
the actual paragraph acquisition module is used for adjusting the natural paragraph to obtain an actual paragraph;
the correlation calculation module is used for calculating the correlation between two adjacent actual paragraphs in sequence;
a synthesized speech generation module for generating a synthesized speech of the actual paragraph;
a language rhythm obtaining module, configured to obtain a language rhythm of each actual paragraph in the synthesized speech;
the variation module is used for adjusting the language rhythm according to the correlation to obtain variation synthesized voice;
wherein, the variation module includes:
the traversal submodule is used for traversing all the actual paragraphs and determining each actual paragraph as a current adjustment paragraph in sequence;
the variation paragraph obtaining sub-module is used for adjusting the language rhythm of the current adjustment paragraph to obtain a variation paragraph when the correlation between the current adjustment paragraph and the last actual paragraph is greater than a first preset threshold; the variation paragraph has a plurality of sentences;
the sentence correlation determination submodule is used for determining the sentence correlation of two adjacent sentences in the variation paragraph;
the sentence adjusting rhythm obtaining submodule is used for adjusting the language rhythm of the current sentence to obtain a sentence adjusting rhythm when the sentence correlation between the current sentence and the previous sentence is larger than a second preset threshold;
the variation synthesis voice generation submodule is used for adjusting the rhythm by adopting sentences of each sentence in all the actual paragraphs to generate variation synthesis voice;
wherein, the variation paragraph acquisition submodule comprises:
a paragraph adjustment index obtaining unit, configured to obtain a first paragraph adjustment index of a current actual paragraph and a second paragraph adjustment index of a previous actual paragraph when a correlation between the current adjustment paragraph and the previous actual paragraph is greater than a first preset threshold; the first paragraph adjustment index is the speech speed, the tone and the volume of the current adjustment paragraph; the second paragraph adjustment index is the speech speed, the tone and the volume of the last actual paragraph;
a target paragraph adjustment index determining unit, configured to compare the first paragraph adjustment index and the second paragraph adjustment index, and determine that a similarity between a speech rate, a pitch, and a volume of a current adjusted paragraph and a previous actual paragraph is greater than a target paragraph adjustment index;
and the variation paragraph acquisition unit is used for adjusting the language rhythm of the current adjusted paragraph based on the target paragraph adjustment index to obtain a variation paragraph.
5. The apparatus of claim 4, wherein the natural passage identification module comprises:
the line-change key identification submodule is used for acquiring a preprocessed text and identifying a line-change key in the preprocessed text;
and the preprocessed text splitting sub-module is used for splitting the preprocessed text into a plurality of natural paragraphs based on the line feed key.
6. The apparatus of claim 4, wherein the actual paragraph retrieving module comprises:
the first scene judgment submodule is used for judging whether each natural paragraph has only one scene;
a natural paragraph splitting submodule, configured to split the natural paragraph according to the scene if the natural paragraph is not split, and generate an actual paragraph;
the second scene judging submodule is used for judging whether the two adjacent natural sections are in the same scene if the two adjacent natural sections are in the same scene;
and the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph if the paragraph merging submodule is used for merging two adjacent natural paragraphs of the same scene into the same actual paragraph.
7. An electronic device, comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the variation method of synthesized speech of any one of claims 1-3 according to instructions in the program code.
8. A computer-readable storage medium characterized in that the computer-readable storage medium stores a program code for executing the variation method of synthesized speech according to any one of claims 1 to 3.
CN202210707967.8A 2022-06-22 2022-06-22 Variation method and device for synthetic voice, electronic equipment and storage medium Active CN114783402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210707967.8A CN114783402B (en) 2022-06-22 2022-06-22 Variation method and device for synthetic voice, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210707967.8A CN114783402B (en) 2022-06-22 2022-06-22 Variation method and device for synthetic voice, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114783402A CN114783402A (en) 2022-07-22
CN114783402B true CN114783402B (en) 2022-09-13

Family

ID=82421438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210707967.8A Active CN114783402B (en) 2022-06-22 2022-06-22 Variation method and device for synthetic voice, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114783402B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615524A (en) * 2018-05-14 2018-10-02 平安科技(深圳)有限公司 A kind of phoneme synthesizing method, system and terminal device
CN114363691A (en) * 2021-04-22 2022-04-15 南京亿铭科技有限公司 Speech subtitle synthesis method, apparatus, computer device, and storage medium
CN114373444B (en) * 2022-03-23 2022-05-27 广东电网有限责任公司佛山供电局 Method, system and equipment for synthesizing voice based on montage

Also Published As

Publication number Publication date
CN114783402A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
US11908451B2 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
KR20150146373A (en) Method and apparatus for speech synthesis based on large corpus
CN113628610B (en) Voice synthesis method and device and electronic equipment
CN110782869A (en) Speech synthesis method, apparatus, system and storage medium
CN112216267B (en) Prosody prediction method, device, equipment and storage medium
CN112270933A (en) Audio identification method and device
GB2603776A (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
CN113823323A (en) Audio processing method and device based on convolutional neural network and related equipment
CN114783402B (en) Variation method and device for synthetic voice, electronic equipment and storage medium
CN108922505B (en) Information processing method and device
CN111785236A (en) Automatic composition method based on motivational extraction model and neural network
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN115206270A (en) Training method and training device of music generation model based on cyclic feature extraction
CN117769739A (en) System and method for assisted translation and lip matching of dubbing
CN113838445B (en) Song creation method and related equipment
CN114078464B (en) Audio processing method, device and equipment
CN113379875B (en) Cartoon character animation generation method, device, equipment and storage medium
US20240205520A1 (en) Method for coherent, unsupervised, transcript-based, extractive summarisation of long videos of spoken content
CN115457931B (en) Speech synthesis method, device, equipment and storage medium
CN112185338B (en) Audio processing method, device, readable storage medium and electronic equipment
CN111681679B (en) Video object sound effect searching and matching method, system, device and readable storage medium
Noufi et al. Unsupervised representation learning for context of vocal music
Wei et al. A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis
Wilhelms-Tricarico et al. The lessac technologies hybrid concatenated system for blizzard challenge 2013

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant