CN114373444B

CN114373444B - Method, system and equipment for synthesizing voice based on montage

Info

Publication number: CN114373444B
Application number: CN202210285222.7A
Authority: CN
Inventors: 余勇; 钟少恒; 陈志刚; 王翊; 曹小冬; 吴启明; 蔡勇超; 林承勋; 吕华良; 丁铖; 林家树; 郭泽豪; 符春造; 方美明; 陈瑾; 李鸿盛
Original assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-05-27
Anticipated expiration: 2042-03-23
Also published as: CN114373444A

Abstract

The application discloses a method, a system and equipment for synthesizing voice based on montage, wherein the method comprises the following steps: after paragraph segmentation preprocessing is carried out on existing natural paragraphs of a text to be processed, the text to be processed is divided into a plurality of actual paragraphs based on a scene type and an emotion level type; calculating the relevance of the scenes and the emotion levels of adjacent paragraphs in a plurality of actual paragraphs; after the intonation parameters of the text to be processed are set, the intonation change proportion and the intonation change direction of the text to be processed are calculated according to the correlation; and performing paragraph speech synthesis on the text to be processed according to the tone change proportion and the tone change direction. The technical problem that the speech synthesis sounds very hard in the prior art is solved.

Description

Method, system and equipment for synthesizing voice based on montage

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a method, a system, and an apparatus for synthesizing speech based on montage.

Background

Montage generally refers to scene conversion in a movie, and materials are selected and rejected through splitting and assembling of a lens, a scene and a paragraph, so that the expressed content is mainly and clearly identified, and high summarization and concentration are achieved.

Disclosure of Invention

The application provides a method, a system and equipment for synthesizing voice based on montage, which are used for solving the technical problem that the voice synthesis in the prior art sounds very hard.

In view of the above, the present application provides, in a first aspect, a montage-based speech synthesis method, including:

after paragraph segmentation preprocessing is carried out on existing natural paragraphs of a text to be processed, the text to be processed is divided into a plurality of actual paragraphs based on a scene type and an emotion level type;

calculating the relevance of scenes and emotion hierarchies of adjacent paragraphs in a plurality of actual paragraphs;

after the intonation parameters of the text to be processed are set, the intonation change proportion and the intonation change direction of the text to be processed are calculated according to the correlation;

and performing paragraph speech synthesis on the text to be processed according to the tone change proportion and the tone change direction.

Optionally, the paragraph segmentation preprocessing is performed on the existing natural paragraphs of the text to be processed, and specifically includes: and performing paragraph division processing on the existing natural paragraphs of the text to be processed through the line feed key.

Optionally, the dividing the text to be processed into a plurality of actual paragraphs based on the scene type and the emotion level type specifically includes:

different paragraphs with the same scene type and the same emotion level type are combined into the same paragraph, and sub-paragraphs with different scene types and different emotion level types in the same paragraph are correspondingly divided into a plurality of paragraphs.

Optionally, the calculating the relevance between the scene and the emotion level of adjacent paragraphs in the actual paragraphs specifically includes:

and carrying out relevance training after marking scenes and emotion levels on the text to be processed manually to obtain a relevance calculation model, and calculating the relevance of the scenes and the emotion levels of adjacent paragraphs in a plurality of actual paragraphs based on the relevance calculation model.

Optionally, after the intonation parameters of the text to be processed are set, the intonation change proportion and the intonation change direction of the text to be processed are calculated according to the correlation, and the method specifically includes:

setting the proportion range of the total tone change of the text to be processed and the upper and lower limits of the reference tone and the initial tone, calculating the tone change proportion of adjacent paragraphs, calculating the proportion of the total tone change and the correlation and the rising and falling of the tone of the adjacent paragraphs, and using the proportion and the rising and falling as the tone change direction, thereby obtaining the tone change proportion and the tone change direction of the text to be processed.

A second aspect of the application provides a montage-based speech synthesis system, the system comprising:

the dividing unit is used for dividing the text to be processed into a plurality of actual paragraphs based on the scene type and the emotion level type after paragraph segmentation preprocessing is carried out on the existing natural paragraphs of the text to be processed;

the first calculation unit is used for calculating the relevance of scenes and emotion hierarchies of adjacent paragraphs in a plurality of actual paragraphs;

the second calculation unit is used for calculating the tone change proportion and the tone change direction of the text to be processed according to the correlation after setting tone parameters of the text to be processed;

and the synthesis unit is used for carrying out paragraph speech synthesis on the text to be processed according to the tone change proportion and the tone change direction.

Optionally, the dividing unit is specifically configured to:

performing paragraph division processing on existing natural paragraphs of a text to be processed through a line-feed key;

Optionally, the first computing unit is specifically configured to:

Optionally, the second computing unit is specifically configured to:

setting the proportion range of the total tone change of the text to be processed, the upper limit and the lower limit of the reference tone and the initial tone, calculating the tone change proportion of adjacent paragraphs, calculating the proportion of the total tone change and the correlation and the rising and falling of the tones of adjacent paragraphs, and using the proportion and the rising and falling as the tone change direction, thereby obtaining the tone change proportion and the tone change direction of the text to be processed.

A third aspect of the application provides a montage-based speech synthesis apparatus, the apparatus comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is adapted to perform the steps of the montage-based speech synthesis method according to the first aspect as described above, according to instructions in the program code.

According to the technical scheme, the method has the following advantages:

the application provides a method for synthesizing voice based on montage, which comprises the following steps: after paragraph segmentation preprocessing is carried out on existing natural paragraphs of a text to be processed, the text to be processed is divided into a plurality of actual paragraphs based on a scene type and an emotion level type; calculating the relevance of scenes and emotion hierarchies of adjacent paragraphs in a plurality of actual paragraphs; after the intonation parameters of the text to be processed are set, the intonation change proportion and the intonation change direction of the text to be processed are calculated according to the correlation; and performing paragraph speech synthesis on the text to be processed according to the tone change proportion and the tone change direction. Compared with the prior art, the method comprises the steps of firstly dividing a text to be processed according to a scene and emotion levels to obtain paragraphs according with an actual scene and emotion, then calculating the correlation between adjacent paragraphs, determining the parameters such as the starting tone of the paragraphs and the reference tone of the actual paragraphs based on the correlation, and thus obtaining the tone change proportion and the tone change direction of the text to be processed, and finally carrying out paragraph speech synthesis according to the determined tone change proportion and the tone change direction, so that the speech synthesis is more vivid and accords with the auditory habits of people. Therefore, the technical problem that the speech synthesis sounds very hard in the prior art is solved.

Drawings

Fig. 1 is a schematic flowchart of an embodiment of a montage-based speech synthesis method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an embodiment of a montage-based speech synthesis system provided in the embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, an embodiment of a method for synthesizing a montage-based speech according to the present application includes:

step 101, after paragraph segmentation preprocessing is carried out on existing natural paragraphs of a text to be processed, the text to be processed is divided into a plurality of actual paragraphs based on scene types and emotion level types;

it should be noted that, in this embodiment, first, a line feed key is used to perform paragraph division processing on existing natural paragraphs of a text to be processed, then different paragraphs with the same scene type and the same emotion level type are merged into the same paragraph, and sub-paragraphs with different scene types and different emotion level types in the same paragraph are correspondingly divided into a plurality of paragraphs. It is understood that, for example: 1) although the text to be processed is two paragraphs, the two paragraphs are combined into one paragraph if the text is in the same scene and the same layer; 2) although the text is a paragraph, the text refers to a plurality of scenes and a plurality of emotion hierarchies, but the text is divided into different paragraphs according to the types of the scenes and the types of the emotion hierarchies.

102, calculating the relevance of scenes and emotion hierarchies of adjacent paragraphs in a plurality of actual paragraphs;

it should be noted that in the embodiment, a correlation calculation model is obtained by performing correlation training after labeling scenes and emotion levels of a text to be processed manually, and the correlation between the scenes and the emotion levels of adjacent paragraphs in a plurality of actual paragraphs is calculated based on the correlation calculation model. It can be understood that a large amount of correlation training needs to be performed through manual labeling, for example, the a and B segment scene correlation K =50%, and the correlation range is: 0 to 100 percent.

103, after the intonation parameters of the text to be processed are set, calculating the intonation change proportion and the intonation change direction of the text to be processed according to the correlation;

it should be noted that, in this embodiment, a ratio range of a total tone change of the text to be processed, an upper limit and a lower limit of a reference tone and a tone starting limit are set, a tone change ratio of adjacent paragraphs is calculated, and a ratio of the total tone change to the correlation and a rise and a fall of a tone of the adjacent paragraphs are calculated and used as a tone change direction, so that the tone change ratio and the tone change direction of the text to be processed are obtained.

The method comprises the following specific steps:

1) determining a total pitch change ratio range {0% -H% } (typically 0-50%); determining an upper limit JDH and a lower limit JDL of a reference intonation value JD and an upper limit QDH and a lower limit QDL of a starting intonation QD;

2) determining a ratio of total pitch change to correlation R = H/100;

3) obtaining the scene correlation Kn of each paragraph relative to the upper part intonation by artificial intelligence;

4) determining the tone change proportion Vn = R × Kn of the current segment and the previous segment;

5) determining the rise and fall of the speech pitch of the section relative to the previous section:

a. after the intonation is raised or lowered, if JD belongs to [ JDH, JDL ] and QD belongs to [ QDH, QDL ], the raising or lowering of the intonation is randomly determined.

b. If JD does not belong to [ JDH _, JDL ] or QD does not belong to [ QDH, QDL ] after intonation lifting, the change direction of intonation is changed (if lifting is originally wanted, lifting is needed after exceeding the range).

And step 104, performing paragraph speech synthesis on the text to be processed according to the tone change proportion and the tone change direction.

Finally, according to the tone variation ratio and the tone variation direction determined in step 103, the paragraph speech synthesis is performed.

The embodiment of the method for synthesizing the voice based on the montage includes the steps of firstly dividing a text to be processed according to a scene and an emotion level to obtain paragraphs conforming to an actual scene and emotion, then calculating correlation of adjacent paragraphs, determining and then determining parameters such as starting tone of the paragraphs and reference tone of the actual paragraphs based on the correlation to obtain tone change proportion and tone change direction of the text to be processed, and finally performing paragraph voice synthesis according to the determined tone change proportion and tone change direction to enable voice synthesis to be more vivid and conform to auditory habits of people. Therefore, the technical problem that the speech synthesis sounds very hard in the prior art is solved.

The foregoing is an embodiment of a method for synthesizing a voice based on a montage provided in the embodiment of the present application, and the following is an embodiment of a system for synthesizing a voice based on a montage provided in the embodiment of the present application.

Referring to fig. 2, an embodiment of a montage-based speech synthesis system provided in an embodiment of the present application includes:

the dividing unit 201 is configured to divide a text to be processed into a plurality of actual paragraphs based on a scene type and an emotion level type after performing paragraph segmentation preprocessing on existing natural paragraphs of the text to be processed;

a first calculating unit 202, configured to calculate correlations between scenes and emotion hierarchies of adjacent paragraphs in a plurality of actual paragraphs;

the second calculating unit 203 is configured to calculate a tone change ratio and a tone change direction of the text to be processed according to the correlation after setting the tone parameter of the text to be processed;

and the synthesis unit 204 is configured to perform paragraph speech synthesis on the text to be processed according to the tone variation ratio and the tone variation direction.

In this embodiment, a montage-based speech synthesis system divides a text to be processed according to a scene and an emotion level to obtain paragraphs conforming to an actual scene and emotion, calculates correlation between adjacent paragraphs, determines and then determines parameters such as a start tone of a paragraph and a reference tone of the actual paragraph based on the correlation to obtain a tone variation ratio and a tone variation direction of the text to be processed, and finally performs speech synthesis on the paragraphs according to the determined tone variation ratio and tone variation direction to make the speech synthesis more vivid and conform to auditory habits of a person. Therefore, the technical problem that the speech synthesis sounds very hard in the prior art is solved.

Further, an embodiment of the present application further provides a montage-based speech synthesis device, where the device includes a processor and a memory:

the processor is used for executing the montage-based speech synthesis method of the above method embodiment according to the instructions in the program code

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the system, the unit and the device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present application.

Claims

1. A method for synthesizing a voice based on a montage is characterized by comprising the following steps:

2. The montage-based speech synthesis method according to claim 1, wherein the pre-processing of paragraph segmentation is performed on existing natural paragraphs of the text to be processed, and specifically comprises: and performing paragraph division processing on the existing natural paragraphs of the text to be processed through the line feed key.

3. The montage-based speech synthesis method according to claim 1, wherein the dividing of the text to be processed into a plurality of actual paragraphs based on the scene type and the emotion level type specifically comprises:

4. The montage-based speech synthesis method according to claim 1, wherein the calculating of the correlation between the scene and the emotion level of adjacent paragraphs in the plurality of actual paragraphs specifically comprises:

5. The montage-based speech synthesis method according to claim 1, wherein after the intonation parameters of the text to be processed are set, the intonation change proportion and the intonation change direction of the text to be processed are calculated according to the correlation, and the method specifically comprises the following steps:

6. A montage-based speech synthesis system, comprising:

the first calculation unit is used for calculating the relevance of the scenes and the emotion hierarchies of adjacent paragraphs in the actual paragraphs;

the second calculation unit is used for calculating the tone change proportion and the tone change direction of the text to be processed according to the correlation after the tone parameters of the text to be processed are set;

7. The montage-based speech synthesis system according to claim 6, wherein the partitioning unit is specifically configured to:

8. The montage-based speech synthesis system of claim 6, wherein the first computing unit is specifically configured to:

and carrying out relevance training after carrying out scene and emotion level labeling on the text to be processed manually to obtain a relevance calculation model, and calculating the relevance of the scenes and the emotion levels of adjacent paragraphs in the actual paragraphs based on the relevance calculation model.

9. The montage-based speech synthesis system of claim 6, wherein the second computing unit is specifically configured to:

10. A montage-based speech synthesis apparatus, the apparatus comprising a processor and a memory:

the processor is configured to perform the montage-based speech synthesis method of any of claims 1-5 according to instructions in the program code.