CN116913245A

CN116913245A - Speech synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN116913245A
Application number: CN202311041784.8A
Authority: CN
Inventors: 王玮; 宋乾标; 程旭; 周旸旻; 李全
Original assignee: Anhui Tingjian Technology Co ltd
Current assignee: Anhui Tingjian Technology Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-10-20

Abstract

The invention provides a voice synthesis method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: extracting character attributes of characters contained in each segment in the target text; matching the character attribute with a preset character attribute, and marking the character attribute based on an attribute matching result to obtain a character scene of a character corresponding to the character attribute; and synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text. The method, the device, the electronic equipment and the storage medium provided by the invention can follow the characteristic that the role changes along with the plot pushing, so that the synthesized voice is more fit with the plot and the style is more diversified, thereby bringing better book listening experience to the user.

Description

Speech synthesis method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for speech synthesis, an electronic device, and a storage medium.

Background

A book listening system is a system that converts book contents into voice through a voice synthesis technology and provides the voice to a user for playing, in which the user can select a book of interest, for example, a novel, and play the book in a voice form. Through the book listening system, a user can easily listen to the storyline, knowledge content and the like in various books, so that time and energy are saved, and reading is more free and comfortable.

However, the existing listening and book system generally uses a smoother tone to play uniformly, and has no obvious difference to the voice styles of different people. This will make the scene rendering effect of the storyline worse and the style simpler when the user listens to the listening system, and influence the user experience. How to immerse the user in the listened story line, so that the user has more picture feeling and substitution feeling, and becomes a difficult problem for improving the use experience of the user.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, which are used for solving the defects that the style of voice synthesis is single and the user experience is influenced in the prior art.

The invention provides a voice synthesis method, which comprises the following steps:

extracting character attributes of characters contained in each segment in the target text;

matching the character attribute with a preset character attribute, and marking the character attribute according to an attribute matching result to obtain a character scene of a character corresponding to the character attribute, wherein the character scene is a scene of the character in the fragment, and the same character corresponds to at least one scene in a plurality of fragments;

And synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

According to the voice synthesis method provided by the invention, the character scene of the character corresponding to the character attribute is obtained by marking the character attribute based on the attribute matching result, which comprises the following steps:

under the condition that the attribute matching result is successful, determining a role scene to which the role attribute belongs based on the matching degree between the role attribute and the preset role attribute which is successful in matching and the role scene to which the preset role attribute which is successful in matching belongs;

and under the condition that the attribute matching result is that the matching fails, creating a role and a role scene based on the role attribute, and taking the role attribute as a preset role attribute of the created role.

According to the voice synthesis method provided by the invention, the determining of the character scene to which the character attribute belongs based on the matching degree between the character attribute and the successfully matched preset character attribute and the character scene to which the successfully matched preset character attribute belongs comprises the following steps:

Creating a role scene to which the role attribute belongs based on the role attribute under the condition that the matching degree is smaller than a preset threshold value, and taking the role attribute as a preset role attribute in the created role scene;

and under the condition that the matching degree is greater than or equal to the preset threshold value, determining the character scene to which the successfully matched preset character attribute belongs as the character scene to which the character attribute belongs.

According to the voice synthesis method provided by the invention, when the matching degree is greater than or equal to the preset threshold value, the voice synthesis method further comprises the following steps:

and updating the preset role attribute successfully matched based on the role attribute.

According to the voice synthesis method provided by the invention, the synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text comprises the following steps:

determining a speaker template of each segment based on the role scene of the role contained in each segment;

and synthesizing and splicing the fragment voice of each fragment based on the speaker template to obtain the synthesized voice corresponding to the target text.

According to the voice synthesis method provided by the invention, the speaker template of each segment is determined based on the role scene of the role contained in each segment, and the voice synthesis method comprises the following steps:

clustering character attributes of the same character in the same character scene to obtain at least one character attribute cluster, and determining a speaker template associated with each character attribute cluster;

and determining the speaker template associated with the character attribute cluster of the character attribute of the character contained in each segment as the speaker template of each segment.

According to the voice synthesis method provided by the invention, the extracting of the character attribute of the character contained in each segment in the target text comprises the following steps:

and extracting the character attribute of any segment based on the character attribute of the character contained in the segment arranged in front of any segment in the target text.

According to the voice synthesis method provided by the invention, the matching of the character attribute with the preset character attribute comprises the following steps:

and respectively matching various attributes in the character attributes with various attributes in the preset character attributes, and determining the attribute matching result based on preset weights of the various attributes and the matching result of the various attributes.

The invention also provides a voice synthesis device, comprising:

the attribute extraction unit is used for extracting the character attribute of the character contained in each fragment in the target text;

the scene determining unit is used for matching the character attribute with a preset character attribute, and marking the character attribute according to an attribute matching result to obtain a character scene of a character corresponding to the character attribute, wherein the character scene is a scene of the character in the fragment, and the same character corresponds to at least one scene in a plurality of fragments;

and the voice synthesis unit is used for synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any of the above-mentioned speech synthesis methods when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech synthesis method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of speech synthesis as described in any of the above.

According to the voice synthesis method, the device, the electronic equipment and the storage medium, the character attribute of each character contained in each segment in the target text is extracted, the character attribute is matched with the preset character attribute, and the character scene of the character corresponding to the character attribute is obtained based on the attribute matching result, so that the segment voices of each segment are synthesized and spliced based on the character scene of the character contained in each segment, the character attribute and the character scene of different characters can be reacted into each segment voice, the characteristic that the characters change along with the pushing of the plot can be followed, the synthesized voice is more attached to the plot, the style is more diversified, the scene rendering effect is improved, and better book listening experience is brought to a user.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a speech synthesis method according to the present invention;

FIG. 2 is a flow chart of step 130 in the speech synthesis method provided by the present invention;

FIG. 3 is a flow chart of step 131 in the speech synthesis method provided by the present invention;

FIG. 4 is a second flow chart of the speech synthesis method according to the present invention;

FIG. 5 is a schematic diagram of a speech synthesis apparatus according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The voice synthesis technology is a technology for converting text information into sound information, and through the voice synthesis technology, the book listening system can convert the text into natural and smooth voice, so that a user can appreciate and understand the story line, the knowledge content and the like in the book at any time and any place without reading.

At present, the technology of the book listening system is mature, the text can be converted into the voice of the corresponding pronunciation according to the pronunciation of the text, but the existing book listening system generally adopts a smoother tone to play uniformly, and has no obvious difference to the voice styles of different characters. This will make the user use listening to the book system when listening, the scene rendering effect of story line in the books is relatively poor, and the style is comparatively single, influences user experience. How to immerse the user in the story line, so that the user has more picture sense and substitution sense, and becomes a difficult problem for improving the use experience of the user.

In the prior art, the emotion, the action and the like of the character in the story are manually simulated, and the real-person dubbing is carried out by combining different scenes, but the method not only needs to consume a great deal of cost, but also has the difficulty in meeting the requirements by means of the voice resources recorded by manual dubbing along with the increasing requirements of users on a book listening system. In this regard, the embodiments of the present invention provide a speech synthesis method, so as to overcome the above-mentioned drawbacks.

Fig. 1 is a schematic flow chart of a speech synthesis method according to the present invention, as shown in fig. 1, the method includes:

Step 110, extracting character attributes of characters contained in each segment in the target text;

specifically, the target text refers to a text to be subjected to speech synthesis, and the target text may be a text directly input by a user, may be a text uploaded by the user, or may be a text obtained by downloading through a network, which is not particularly limited in the embodiment of the present invention. After the target text is obtained, the target text can be divided, and each segment in the target text is obtained.

Here, when dividing the target text, the division may be performed based on the structural information of the target text itself, for example, the division may be performed according to the pieces, chapters, and sections by applying the directory of the target text itself; the division may also be based on the spread of the target text, for example, dividing a segment by a preset number of words, or dividing a segment by a preset number of pages. After the target text is divided, a fragment sequence corresponding to the target text can be obtained, wherein the fragment sequence comprises each fragment in the target text, and each fragment in the fragment sequence is arranged according to the sequence in the target text.

After each segment in the target text is obtained, character attribute extraction can be performed on each segment to obtain character attributes of characters contained in each segment. Here, the character attribute refers to a description of characteristics, characters, capabilities, backgrounds, and the like of the character in the story, and for example, the character attribute may include a name, a gender, an age group, character key characteristics, a background, a year in which the character is located, an accent, a speaking manner, and the like of the character. The age group may be classified into young, teenager, middle-aged, elderly, etc., and the background may include a family background of a character, a growing experience, an educational background, a professional experience, a language background, etc.

In an embodiment, when extracting the character attribute of the character included in each segment, any segment may be input into the pre-trained attribute extraction model to obtain the character attribute of the character included in the segment. The method comprises the following specific steps: carrying out semantic understanding analysis on texts contained in the input fragments, and carrying out statistics to obtain all the outgoing characters in the fragments; based on descriptive text information and context dialogue in the text contained in the segment, extracting key elements from the segment to obtain all character attributes; and correlating the character attribute with the statistically obtained out-of-office characters to obtain the character attribute of the character contained in the segment.

It will be appreciated that the pre-trained attribute extraction model described above is a deep learning model for automatically extracting character attributes of a character from text. The pre-trained attribute extraction model may be pre-trained through a deep neural network, for example, a Convolutional Neural Network (CNN) or a long-short-term memory network (LSTM) may be used to learn to extract representations of character attributes from text, thereby training to obtain the pre-trained attribute extraction model; a generic pre-trained language model, such as BERT (Bidirectional Encoder Representations from Transformers), may also be utilized to learn the representation of text and fine tune it to extract character attributes, resulting in a pre-trained attribute extraction model.

In another embodiment, the step 110 specifically includes: and extracting the character attribute of any segment based on the character attribute of the character contained in the segment arranged before any segment in the target text.

Here, when the character attribute of the character included in each segment is extracted, the extraction may be performed according to the sequence of the arrangement, and when the character attribute of the segment is extracted, the segment itself and the character attribute of the character included in the previous segment of the segment may be input into the pre-trained attribute extraction model together, so as to obtain the character attribute of the character included in the segment. The method comprises the following specific steps: carrying out semantic understanding analysis on texts contained in the input fragments, and carrying out statistics to obtain all the outgoing characters in the fragments; based on descriptive text information and context dialogue contained in the text of the segment and the character attribute of the segment before the segment, extracting key elements from the segment to obtain all character attributes of the segment; and correlating the character attribute with the statistically obtained out-of-office characters to obtain the character attribute of the character contained in the segment.

In the embodiment of the invention, when the character attribute of the character contained in each segment is extracted, the character attribute of the previous segment is used as the reference condition for extracting the character attribute of the next segment, so that the influence of plot pushing change on the character attribute is fully considered when the character attribute is extracted, the extracted character attribute is more in line with the plot, the synthesized voice is more in line with the plot situation, and better listening experience is brought to the user.

Step 120, matching the character attribute with a preset character attribute, and marking the character attribute according to the attribute matching result to obtain a character scene of the character corresponding to the character attribute, wherein the character scene is a scene of the character in the segment, and the same character corresponds to at least one scene in a plurality of segments;

specifically, the preset character attribute refers to character attributes of all preset characters in a preset character library, the preset character library refers to a resource library which is built in advance and contains various preset characters, the preset characters comprise two types of characters, namely facial makeup characters and modeling characters, the facial makeup characters are system preset characters, such as a male bystander, a male character, a female character, a preset character representing a professional field and the like, and the modeling characters refer to characters which are created based on text fragments, such as a first main angle.

After extracting the character attribute of the character contained in each segment, the character attribute of each character can be matched with the preset character attribute of each preset character in the preset character library, so as to obtain an attribute matching result. If the character attribute is matched, the attribute matching result is indicated to be successful, at the moment, the extracted character attribute and the successfully matched preset character attribute can be associated, and the character attribute is marked for identification and use in subsequent processing, and meanwhile, the character scene of the character corresponding to the character attribute is obtained. If the character attribute is not matched with the preset character attribute, the attribute matching result is indicated to be the matching failure, the character corresponding to the character attribute can be determined to be a new character, at the moment, a new modeling character and a new character scene can be created based on the extracted character attribute, the character attribute is marked with the character, the character scene of the character corresponding to the character attribute is obtained, and meanwhile, the created modeling character can be stored in a preset character library so as to carry out the next attribute matching.

It will be appreciated that the above-mentioned character scenario refers to a scenario where a character belongs in a segment, i.e. a scenario where a character is applied, and the same character may include a plurality of character scenarios, for example, a person in front may struggle with something, and whose sound is repeatedly switched between front, back and front, and yin, and two character scenarios, i.e. a front and a back scene, may be created for the character. By obtaining the role scene of the role corresponding to the role attribute, the role attribute and the role scene can be reacted to each synthesized segment voice, so that the synthesized voice accords with the story situation more.

When the character attribute is matched with the preset character attribute, various attributes in the character attribute and various attributes in the preset character attribute can be respectively matched, and a final attribute matching result is determined based on the matching result of the various attributes. For example, the similarity between each type of attribute in the character attribute and each type of attribute in the preset character attribute can be calculated through a similarity algorithm, and whether the matching is successful or not is judged according to the similarity by setting a proper threshold. It should be appreciated that in performing character attribute matching, matching of names may take precedence over matching of other attributes.

After the attribute matching result is obtained, the character marking can be performed on the character attribute based on the attribute matching result, namely, the character marking can be added into corresponding texts of the fragments according to a preset character successfully matched or a newly created modeling character, the texts contained in the fragments are associated with the corresponding character attribute and character scene, so that what character is used for deducting each sentence in the fragments in what character scene is convenient for subsequent voice synthesis.

And 130, synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

Specifically, based on the obtained roles and role scenes contained in each segment, the segment voice of each segment can be synthesized, and then the segment voices of the segments are spliced and integrated according to the sequence, so that the synthesized voice corresponding to the target text can be obtained.

After extracting the character attribute of the character included in each segment, matching can be performed from the sound library based on the character attribute, such as the sex, age, speaking mode, accent, and the like of the character, so as to obtain the sound character. Here, the sound character refers to a current existing speaker, and the sound library is a set of voice samples including a plurality of different speakers, and can provide voice features and sound samples of the different speakers. It will be appreciated that the construction of the voice library may be implemented by recording a large number of voice samples, or by extracting and generating speaker characteristics by data collection, voice analysis and processing techniques, which is not particularly limited in this embodiment of the present invention.

After the sound characters corresponding to the character attributes are obtained by matching, the text contained in each segment, the character attributes and the character scenes associated with the text, and the sound characters corresponding to the characters can be input into the voice synthesis model, and the synthesized voice of each segment can be obtained.

Further, in order to improve the scene rendering effect of the synthesized voice, the user can learn the role more deeply, feel the emotion change of the role, semantic understanding and emotion element extraction can be performed on the text contained in each segment before synthesizing the segment voice of each segment, and the emotion, emotion coefficient, intonation and other data of each sentence in the text are marked, so that the segment voice of each segment is generated based on the text contained in the segment, the role attribute associated with the text, the role scene and emotion element, and the voice role corresponding to the role.

In addition, in order to help the user to better understand the story, the target text generally includes a bystander, and the story line, the figure psychology and the like are described by the bystander, so that more abundant information is transmitted, and in order to obtain the synthesized voice corresponding to the target text, the bystander needs to be synthesized. When the text contained in each segment is subjected to semantic understanding analysis, the bystander text contained in each segment is identified and obtained, and the bystander text is associated with the bystander character in the preset character library, so that in the subsequent voice synthesis process, the voice synthesis of the bystander text can be realized through the bystander character and the character attribute corresponding to the bystander character, and the synthesized voice of the more complete target text is obtained.

According to the method provided by the embodiment of the invention, the character attribute of each character contained in the target text is extracted, the character attribute is matched with the preset character attribute, and the character scene of the character corresponding to the character attribute is obtained based on the attribute matching result, so that the segment voice of each segment is synthesized and spliced based on the character scene of the character contained in each segment, the character attribute and the character scene of different characters can be reacted into each segment voice, the characteristic that the characters change along with the pushing of the plot can be followed, the synthesized voice is more attached to the plot, the voice styles of different characters are more diversified, the scene rendering effect is improved, and better listening experience is brought to the user.

Based on the above embodiment, in step 120, based on the attribute matching result, the role is marked on the role attribute, so as to obtain the role scene of the role corresponding to the role attribute, which includes:

step 121, determining a role scene to which the character attribute belongs based on the matching degree between the character attribute and the preset character attribute successfully matched and the role scene to which the preset character attribute successfully matched belongs when the attribute matching result is that the matching is successful;

Specifically, the attribute matching result is successful, which indicates that the preset character attribute is successfully matched in the preset character library, if the character corresponding to the successfully matched preset character attribute is a facial makeup character, the character can be directly marked, and the character scene to which the preset character attribute belongs is used as the character scene to which the extracted character attribute belongs. If the character corresponding to the successfully matched preset character attribute is the modeling character, the character scene of the character corresponding to the character attribute can be determined based on the matching degree between the extracted character attribute and the successfully matched preset character attribute and the character scene of the successfully matched preset character attribute.

Aiming at modeling roles, when the matching degree between the role attributes and the preset role attributes which are successfully matched is obtained, a similarity matching algorithm can be used for matching the existing preset role attributes with the extracted role attributes, if the attribute information is completely consistent, the role can be directly marked, and the role scene which the preset role attributes which are successfully matched belong to is used as the role scene which the extracted role attributes belong to; if the attribute information has the difference, whether to create a new role scene can be determined based on the matching degree, so that the role scene to which the role attribute belongs is obtained.

It can be understood that, in the case of lower matching degree, that is, larger information deviation between the character attribute and the preset character attribute, the character attribute of the character can be determined to have larger front-back variation, and at this time, a new character scene can be created based on the character attribute; under the condition of higher matching degree, namely smaller information deviation between the character attribute and the preset character attribute, the character scene which the successfully matched preset character attribute belongs to can be used as the character scene which the extracted character attribute belongs to.

And step 122, creating a role and a role scene based on the role attribute and taking the role attribute as a preset role attribute of the created role when the attribute matching result is that the matching fails.

Specifically, if the attribute matching result is that the matching fails, it indicates that the preset character attribute is not matched in the preset character library, and at this time, the character corresponding to the character attribute can be determined to be a new character, so that a new modeling character and a character scene can be created based on the extracted character attribute. When the role scene is created, the text contained in the fragment can be subjected to semantic understanding, key elements related to the scene are extracted, and the corresponding role scene can be created for the role by combining the extracted role attributes.

After the character and the character scene corresponding to the character attribute are created, the character attribute can be used as a preset character attribute of the created character in the character scene, and the created character and the preset character attribute are stored in a preset character library, so that the character attribute of other fragments can be matched conveniently.

According to the method provided by the embodiment of the invention, the character scene of the character corresponding to the character attribute is determined based on the attribute matching result, so that the character scene more conforming to the plot can be obtained according to the characteristic of plot change, and the synthesized voice is more vivid and is attached to the character.

Based on the above embodiment, in step 121, determining the role scene to which the role attribute belongs based on the matching degree between the role attribute and the preset role attribute successfully matched and the role scene to which the preset role attribute successfully matched belongs, includes:

under the condition that the matching degree is smaller than a preset threshold value, creating a role scene to which the role attribute belongs based on the role attribute, and taking the role attribute as a preset role attribute in the created role scene;

and under the condition that the matching degree is greater than or equal to a preset threshold value, determining the character scene to which the successfully matched preset character attribute belongs as the character scene to which the character attribute belongs.

Specifically, the preset threshold is a predefined similarity threshold, which may be set according to actual requirements, for example, the preset threshold may be 60% or 70%, and in the case that the matching degree is smaller than the preset threshold, it indicates that the difference between the character attribute and the character attribute successfully matched is larger, at this time, the character attribute of the character may be considered to change greatly from front to back, and is defined as a situation of switching among multiple character scenes, so that a new character scene may be created for the character based on the character attribute and the text included in the segment, and the created character scene may be marked as a default character scene of the character, so as to achieve rapid implementation of repeated switching of the same character under different character scenes, and meanwhile, the character attribute may be used as the preset character attribute under the created character scene, so as to match the character attribute of other segments subsequently.

And under the condition that the matching degree is greater than or equal to a preset threshold value, the difference between the character attribute and the preset character attribute successfully matched is smaller, and at the moment, the character scene to which the preset character successfully matched belongs can be directly used as the character scene to which the character attribute belongs.

According to the method provided by the embodiment of the invention, the role scene to which the role attribute belongs is determined according to the matching degree between the role attribute and the preset role attribute which is successfully matched, so that the aim of distinguishing the role scenes can be fulfilled, and the same role can be rapidly switched under different scenes.

Based on the above embodiment, in the case that the matching degree is greater than or equal to the preset threshold, the method further includes:

and updating the preset character attribute successfully matched based on the character attribute.

It should be noted that, considering that the role is continuously evolved and changed along with the promotion of the scenario, in order to make the role attribute and the scenario more fit, in the embodiment of the present invention, the preset role attribute information corresponding to the role in the preset role library is kept up to date by updating the successfully matched preset role attribute through the extracted role attribute.

Specifically, when the matching result is not completely consistent, but the matching degree is greater than or equal to the preset threshold, the role can be considered to have new change along with the progress of the scenario, and at the moment, the preset role attribute of the modeling role existing in the preset role library can be updated based on the extracted role attribute, and the preset role attribute is defined as the latest modeling role, so that the effect that the role is continuously evolved along with the promotion of the scenario is achieved.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of step 130 in the speech synthesis method provided by the present invention, and as shown in fig. 2, step 130 specifically includes:

step 131, determining a speaker template of each segment based on the role scene of the role contained in each segment;

step 132, synthesizing and splicing the segment voice of each segment based on the speaker template to obtain the synthesized voice corresponding to the target text.

Specifically, before synthesizing the segment speech of each segment, the speaker template corresponding to each segment can be obtained by matching from a speaker template library based on the roles and the role scenes contained in each segment, where the speaker template library is a resource library for speech synthesis and can be used for defining the speech characteristics and styles of different roles.

The speaker template library may include: 1) Predefined speakers that may be created according to specific character or personality settings, e.g., a character of a turnera may have specific speech characteristics such as slower speech, lower pitch, etc.; while the elegant character may have fluent, graceful, elegant speech. 2) Data generated based on character attributes, such as merchants with accents but talking about elegance.

When matching is performed from the speaker template library, matching can be performed according to the character attribute and the character scene of the character to obtain speaker templates corresponding to each character attribute, wherein the speaker templates refer to voice characteristic data for specific speakers, which are obtained from the speaker template library in a matching mode.

For any segment, after matching to obtain the speaker template corresponding to each character attribute, determining all speaker templates corresponding to the segment, and inputting all speaker templates corresponding to the segment, texts contained in the segment and emotion elements corresponding to the texts into a speech synthesis model to obtain the segment speech of the segment. And then, splicing and integrating the segment voices of the segments according to the sequence, and obtaining the synthesized voice corresponding to the target text.

According to the method provided by the embodiment of the invention, the segment voice of each segment is synthesized and spliced based on the speaker template, so that each role in the synthesized voice has the specific speaking style and sound characteristics, the synthesized voice is more vivid and is more attached to the identity of the role, and the personalized and diversified voice synthesis effect is realized.

Based on any of the above embodiments, fig. 3 is a schematic flow chart of step 131 in the speech synthesis method provided by the present invention, and as shown in fig. 3, step 131 specifically includes:

step 1311, clustering character attributes of the same character in the same character scene to obtain at least one character attribute cluster, and determining a speaker template associated with each character attribute cluster;

step 1312, determining the speaker template associated with the character attribute cluster to which the character attribute of the character included in each segment belongs as the speaker template of each segment.

It should be noted that, considering that the character attribute of a character changes along with the pushing of a scenario, and the number of times and speed of the character attribute changes of different characters are different based on the scenario, therefore, for any character, the character attribute of the character in each segment can be clustered, the number of the character attribute corresponding to the character and the speaker template corresponding to each type of character attribute are determined according to the character attribute cluster obtained by clustering, which is helpful for improving the efficiency of speaker template matching determination.

Specifically, for any one character, when the character attribute of the character is clustered, the clustering can be realized through a clustering algorithm (such as K-means), the character attribute of the same character in the same character scene is clustered, so that at least one character attribute cluster is obtained, and the number of the character attributes corresponding to the character can be determined according to the number of the character attribute clusters obtained by clustering. Then, for each type of character attribute after the character clustering, the speaker template corresponding to each character attribute cluster can be obtained by matching from the speaker template library.

After the speaker templates corresponding to the attribute clusters of each character are obtained, the speaker templates associated with the attribute clusters of the character to which the attribute of the character included in each segment belongs can be determined as the speaker templates corresponding to each segment. For example, for a certain segment, the roles included in the segment have a first principal angle, a second auxiliary angle and a bystander, based on the obtained speaker template corresponding to each role attribute cluster, the speaker template corresponding to the first principal angle can be determined to be a template a, the speaker template corresponding to the second auxiliary angle is a template B, and the speaker template corresponding to the bystander is a template C, so that the speaker templates corresponding to the segment can be determined to include a template a, a template B and a template C.

According to the method provided by the embodiment of the invention, the role attribute clusters of each role can be rapidly determined by clustering the role attributes of the same role in the same role scene, and the associated speaker templates are matched for each role attribute cluster so as to be used when synthesizing the voice, so that the voice synthesis efficiency can be greatly improved, and matching from a speaker template library is not required for each segment. In addition, the voice features of the same character in different scenes can be distinguished, so that the synthesized voice is more in line with the character identity and the character scene.

Based on any of the above embodiments, in step 120, matching the character attribute with the preset character attribute includes:

and respectively matching various attributes in the character attributes with various attributes in the preset character attributes, and determining an attribute matching result based on preset weights of the various attributes and matching results of the various attributes.

Specifically, the preset weights are character attributes of the pointers on some keys, and different matching weights can be preset, so that a more accurate attribute matching result can be obtained when character attribute matching is performed, and a more proper character can be used in the subsequent voice synthesis process. For example, the preset weight of the name may be 30%, the preset weight of the gender may be 30%, the preset weight of the age group may be 15%, the preset weight of the character key feature may be 10%, etc. The preset weights of various attributes can be set according to actual requirements and scenes, and the embodiment of the invention is not particularly limited.

And respectively matching various attributes in the extracted character attributes with various attributes in preset character attributes to obtain matching results of the various attributes, calculating a matching score for each attribute according to the matching results, multiplying the matching score of the various attributes with corresponding preset weights to obtain a comprehensive score, and determining a final attribute matching result according to the comprehensive score.

According to the method provided by the embodiment of the invention, when the character attribute is matched with the preset character attribute, the final attribute matching result is determined based on the preset weight of each attribute and the matching result of each attribute, so that the precision of the attribute matching result can be improved.

Based on any of the above embodiments, fig. 4 is a second flow chart of a speech synthesis method according to the present invention, as shown in fig. 4, the method includes:

step 410, obtaining a target text, and dividing the target text to obtain a fragment sequence of the target text;

here, the target text is divided, which may be based on the structural information of the target text itself, for example, the directory of the target text may be applied, and the division may be performed according to the segments, chapters, and sections; the division may also be based on the spread of the target text, for example, dividing a segment by a preset number of words, or dividing a segment by a preset number of pages.

After the division is completed, a segment sequence corresponding to the target text can be formed, wherein the segment sequence comprises a plurality of segments, and the segments in the segment sequence are arranged according to the sequence in the target text.

Step 420, extracting role attributes of each segment in the segment sequence;

In an embodiment, for the extraction of the character attribute in each segment, it may be performed separately for each segment: for one of the segments, when extracting the character attribute of the segment, the segment itself may be input into a pre-trained attribute extraction model to obtain the character attribute of the segment.

In another embodiment, considering that the attribute of the character changes along with the promotion of the scenario, and the attribute of the character formed later also necessarily changes based on the attribute of the character formed earlier, when extracting the attribute of the character of each segment, the following steps can be sequentially performed: for one of the segments, when the character attribute of the segment is extracted, the segment itself and the character attribute of the segment before the segment can be input into a pre-trained attribute extraction model together to obtain the character attribute of the segment.

Here, the character attribute may be obtained by semantic understanding analysis and key element extraction, and the specific steps are as follows:

step 421, carrying out semantic understanding analysis on texts contained in the input fragments, and carrying out statistics on all the outgoing characters in the fragments;

step 422, extracting key elements, such as name, gender, age group, character key features, etc., of the segment based on the descriptive text information and the context dialogue in the text contained in the segment or based on the descriptive text information and the context dialogue in the text contained in the segment and the character attribute of the segment before the segment, so as to obtain all character attributes of the segment, and associating with the character of the attendance statistically obtained in step 411, so as to obtain the character attribute of the character contained in the segment.

After extracting the character attribute of the character included in each segment, matching can be performed from the sound library based on the character attribute, such as the gender, age, speaking mode, accent, and the like of the character, so as to obtain a sound character for subsequent voice synthesis.

Step 430, matching the character attribute with a preset character attribute, and marking the character attribute according to the attribute matching result to obtain a character scene of the character corresponding to the character attribute;

step 431, if the preset character attribute cannot be matched, indicating that the attribute matching result is a matching failure, determining that the character corresponding to the character attribute is a new character, creating a new character and a new character scene based on the character attribute, and marking the character, namely associating the text contained in the fragment with the corresponding character attribute and the corresponding character scene for subsequent voice synthesis;

step 432, if the attribute of the preset character is matched, indicating that the attribute matching result is successful, and determining further based on the character category corresponding to the attribute of the preset character;

step 4321, if the character is facial-patterned, determining the character scene to which the character attribute belongs based on the character scene of the facial-patterned character, and directly marking the character;

Step 4322, if the character class is the existing modeling character, determining the character scene to which the character attribute belongs based on the matching degree between the character attribute and the preset character attribute.

Here, for the modeling role, a similarity matching algorithm can be used in combination with preset weights of various attributes in the role attributes, so that the matching degree is obtained by matching the role attributes with the successfully matched preset role attributes. If the matching degree is greater than or equal to the preset threshold, the role is considered to change along with the progress of the plot, but the role scene is not changed obviously, at the moment, the preset role attribute can be updated based on the role attribute, the role is marked, and the latest modeling role is defined, so that the effect that the role is continuously evolved along with the progress of the plot is achieved. If the matching degree is smaller than the preset threshold, the character attribute of the character is considered to have larger front-back change, the situation of switching among a plurality of character scenes is defined, a new character scene can be created based on the character attribute and marked as a default character scene of the character, so that the repeated switching of the same character under different scenes can be realized quickly, and meanwhile, the successfully matched preset character attribute is updated.

Step 440, based on the role scene, clustering the role attributes of each segment, and determining the speaker template corresponding to each segment;

here, for any character, the character attributes of the character in each segment may be clustered, that is, the character attributes of the character in the same character scene are clustered to obtain at least one character attribute cluster, and the number of character attributes corresponding to the character is determined according to the number of the character attribute clusters obtained by clustering. Then, for each type of character attribute after the character clustering, the speaker templates corresponding to each character attribute cluster can be obtained by matching from a speaker template library, and the speaker templates corresponding to each segment can be obtained based on the speaker templates associated with the character attribute clusters to which the character attribute of the character contained in each segment belongs.

And 450, synthesizing and splicing the segment voice of each segment based on the speaker template to obtain the synthesized voice corresponding to the target text.

After the speaker templates corresponding to each segment are obtained, all the speaker templates corresponding to the segment and the text contained in the segment can be input into a speech synthesis model, so that the segment speech of the segment can be obtained. And then, splicing and integrating the segment voices of the segments according to the sequence, and obtaining the synthesized voice corresponding to the target text.

According to the method provided by the embodiment of the invention, the character attribute extraction is carried out on each segment, the segment voice of each segment is generated based on the character attribute and the character scene, the situation that the character changes along with the pushing of the plot is fully considered, and the sound attribute of the character also changes along with the plot is fully considered, so that the voice synthesis can be carried out by using the corresponding sound attribute under different plots, the synthesized voice can be more consistent with the character and the plot, and better listening experience is brought to the user.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of a speech synthesis apparatus according to the present invention, as shown in fig. 5, where the apparatus includes:

an attribute extraction unit 510, configured to extract a character attribute of a character included in each segment in the target text;

the scene determining unit 520 is configured to match the character attribute with a preset character attribute, and perform character marking on the character attribute based on an attribute matching result to obtain a character scene of a character corresponding to the character attribute, where the character scene is a scene to which the character belongs in a segment, and the same character corresponds to at least one scene in a plurality of segments;

the speech synthesis unit 530 is configured to synthesize and splice the segment speech of each segment based on the role scene of the role included in each segment, so as to obtain the synthesized speech corresponding to the target text.

According to the device provided by the embodiment of the invention, the character attribute of the character contained in each segment in the target text is extracted, the character attribute is matched with the preset character attribute, and the character scene of the character corresponding to the character attribute is obtained based on the attribute matching result, so that the segment voice of each segment is synthesized and spliced based on the character scene of the character contained in each segment, therefore, the character attribute and the character scene of different characters can be reacted into each segment voice, the characteristic that the characters change along with the pushing of the plot can be followed, the synthesized voice is more attached to the plot, the voice styles of different characters are more diversified, the scene rendering effect is improved, and better listening experience is brought to the user.

Based on any of the above embodiments, the scene determining unit 520 specifically includes:

the first determining subunit is used for determining the role scene to which the role attribute belongs based on the matching degree between the role attribute and the preset role attribute which is successfully matched and the role scene to which the preset role attribute which is successfully matched belongs when the attribute matching result is that the matching is successful;

and the second determining subunit is used for creating a role and a role scene based on the role attribute and taking the role attribute as a preset role attribute of the created role when the attribute matching result is that the matching fails.

Based on any of the above embodiments, the first determining subunit is specifically configured to:

Based on any of the above embodiments, the first determining subunit is further configured to:

and updating the successfully matched preset role attribute based on the role attribute under the condition that the matching degree is greater than or equal to the preset threshold value.

Based on any of the above embodiments, the speech synthesis unit 530 specifically includes:

a template determination subunit, configured to determine a speaker template of each segment based on a role scene of a role included in each segment;

and the voice synthesis subunit is used for synthesizing and splicing the segment voice of each segment based on the speaker template to obtain the synthesized voice corresponding to the target text.

Based on any of the above embodiments, the template determination subunit is specifically configured to:

Based on any of the above embodiments, the attribute extraction unit 510 is specifically configured to:

and extracting the character attribute of any segment based on the character attribute of the character contained in the segment arranged before any segment in the target text.

Based on any of the above embodiments, the scene determination unit 520 further includes an attribute matching subunit for:

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a speech synthesis method comprising: extracting character attributes of characters contained in each segment in the target text; matching the character attribute with a preset character attribute, and marking the character attribute according to the attribute matching result to obtain a character scene of the character corresponding to the character attribute, wherein the character scene is a scene of the character in the fragment, and the same character corresponds to at least one scene in a plurality of fragments; and synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the speech synthesis method provided by the above methods, the method comprising: extracting character attributes of characters contained in each segment in the target text; matching the character attribute with a preset character attribute, and marking the character attribute according to the attribute matching result to obtain a character scene of the character corresponding to the character attribute, wherein the character scene is a scene of the character in the fragment, and the same character corresponds to at least one scene in a plurality of fragments; and synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech synthesis method provided by the above methods, the method comprising: extracting character attributes of characters contained in each segment in the target text; matching the character attribute with a preset character attribute, and marking the character attribute according to the attribute matching result to obtain a character scene of the character corresponding to the character attribute, wherein the character scene is a scene of the character in the fragment, and the same character corresponds to at least one scene in a plurality of fragments; and synthesizing and splicing the segment voice of each segment based on the role scene of the role contained in each segment to obtain the synthesized voice corresponding to the target text.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech synthesis, comprising:

2. The method of claim 1, wherein the performing role marking on the role attribute based on the attribute matching result to obtain a role scene of a role corresponding to the role attribute includes:

3. The method according to claim 2, wherein the determining a character scene to which the character attribute belongs based on a degree of matching between the character attribute and a preset character attribute that is successfully matched and a character scene to which the preset character attribute that is successfully matched belongs includes:

4. The speech synthesis method according to claim 3, further comprising, in the case where the degree of matching is greater than or equal to the preset threshold:

5. The method for synthesizing speech according to claim 1, wherein synthesizing and concatenating the segment speech of each segment based on the character scene of the character included in each segment to obtain the synthesized speech corresponding to the target text, comprises:

6. The method of claim 5, wherein determining the speaker template for each segment based on the character scene of the character included in each segment comprises:

7. The method according to any one of claims 1 to 6, wherein extracting character attributes of characters included in each segment of the target text includes:

8. The method according to any one of claims 1 to 6, wherein said matching the character attribute with a preset character attribute comprises:

9. A speech synthesis apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech synthesis method according to any of claims 1 to 8 when executing the program.

11. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the speech synthesis method according to any one of claims 1 to 8.