CN113539235B - Text analysis and speech synthesis method, device, system and storage medium - Google Patents

Text analysis and speech synthesis method, device, system and storage medium Download PDF

Info

Publication number
CN113539235B
CN113539235B CN202110787732.XA CN202110787732A CN113539235B CN 113539235 B CN113539235 B CN 113539235B CN 202110787732 A CN202110787732 A CN 202110787732A CN 113539235 B CN113539235 B CN 113539235B
Authority
CN
China
Prior art keywords
character
information
text
role
age
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110787732.XA
Other languages
Chinese (zh)
Other versions
CN113539235A (en
Inventor
潘华山
李秀林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beibei Qingdao Technology Co ltd
Original Assignee
Beibei Qingdao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beibei Qingdao Technology Co ltd filed Critical Beibei Qingdao Technology Co ltd
Priority to CN202110787732.XA priority Critical patent/CN113539235B/en
Publication of CN113539235A publication Critical patent/CN113539235A/en
Application granted granted Critical
Publication of CN113539235B publication Critical patent/CN113539235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention provides a text analysis and voice synthesis method, a device, a system and a storage medium. The method comprises the following steps: acquiring a text to be processed; performing name recognition on the text to be processed to determine all names appearing in the text to be processed; clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one; determining global role information based at least on the at least one personal name set, the global role information comprising at least one set of role information in one-to-one correspondence with the at least one role, each set of role information comprising a representative role name and alias set of the corresponding role; performing text analysis on any target sentence in the text to be processed in combination with the global role information, wherein the text analysis comprises analysis on at least one preset item, and the at least one preset item comprises one or more of the following: text type, character name, and character attribute. And utilizing the global role information to assist in identifying the local role information.

Description

Text analysis and speech synthesis method, device, system and storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a text analysis method, apparatus, system, and storage medium, and a speech synthesis method, apparatus, system, and storage medium.
Background
The speech synthesis technology is a technology for converting text information into sound information. The speech synthesis technology can provide speech synthesis services for a wide range of users and target applications. Speech synthesis systems are now in wide-spread use. With the increasing demand of users for audio book resources, it has been difficult to meet the demand by means of manually recording audio book corpus, so it is important to develop multi-role multi-emotion (semi-) automatic speech synthesis technology/tool/system.
In a multi-role multi-emotion speech synthesis system, text analysis is first required to be performed on a text to be synthesized, so as to obtain information such as role names, role attributes, emotion categories and the like corresponding to each dialogue (and monologue) sentence in the text. Then, a speech synthesis model matching the character is found out from the analyzed character various information to further perform speech synthesis through the model.
In the prior art, the text analysis described above is typically a local analysis based on the target sentence and its context, which only can obtain local character information. However, information such as a character name, gender, age, etc. is usually in a stable state in a wide range (or global range), and it is difficult to associate the wide range of character information by the above-described local analysis method, which is disadvantageous to the identification and association of the same character. In addition, the local information contains insufficient information content, and it is often difficult to identify some attribute information of a character, such as sex, etc., only by the local information.
Disclosure of Invention
In order to at least partially solve the problems in the prior art, a text analysis method, a text analysis device, a text analysis system, a text analysis storage medium, a speech synthesis method, a speech synthesis device, a speech synthesis system and a speech synthesis storage medium are provided.
According to one aspect of the present invention, there is provided a text analysis method including: acquiring a text to be processed; performing name recognition on the text to be processed to determine all names appearing in the text to be processed; clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one; determining global role information at least based on the at least one personal name set, wherein the global role information comprises at least one group of role information corresponding to the at least one role one by one, each group of role information comprises a representative role name of the corresponding role and an alias set, and the alias set comprises personal names except the representative role name in the personal name set of the corresponding role; performing text analysis on any target sentence in the text to be processed in combination with the global role information to obtain a text analysis result corresponding to the target sentence, wherein the text analysis comprises analysis on at least one preset item, and the at least one preset item comprises one or more of the following items: the text type is analyzed to judge whether the target sentence belongs to a multi-role type or not, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age.
According to another aspect of the present invention, there is also provided a text analysis system comprising a processor and a memory, wherein the memory stores computer program instructions for executing the text analysis method described above when the computer program instructions are executed by the processor.
According to another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, the program instructions being operable, when executed, to perform the above text analysis method.
According to another aspect of the present invention, there is also provided a speech synthesis method including the above text analysis method, wherein the speech synthesis method further includes: and performing voice synthesis on the target sentence at least based on the text analysis result so as to obtain a synthesized voice corresponding to the target sentence.
According to another aspect of the present invention, there is also provided a speech synthesis system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the above-mentioned speech synthesis method.
According to another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, which program instructions are used, when run, to perform the above-described speech synthesis method.
According to another aspect of the present invention, there is also provided a speech synthesis method including the above text analysis method, wherein the speech synthesis method further includes: outputting text result information, wherein the text result information comprises an initial text analysis result obtained by the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information; receiving text feedback information input by a user; modifying the initial text analysis result based on the first modification information to obtain a new text analysis result in the case that the text feedback information includes the first modification information related to the initial text analysis result; and performing speech synthesis on the target sentence at least based on the final text analysis result to obtain a final synthesized speech corresponding to the target sentence, wherein the final text analysis result is an initial text analysis result without modification of the initial text analysis result, and the final text analysis result is a new text analysis result with modification of the initial text analysis result.
According to another aspect of the present invention, there is also provided a speech synthesis system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the above-mentioned speech synthesis method.
According to another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, which program instructions are used, when run, to perform the above-described speech synthesis method.
According to another aspect of the present invention, there is also provided a text analysis apparatus including: the acquisition module is used for acquiring the text to be processed; the personal name recognition module is used for carrying out personal name recognition on the text to be processed so as to determine all personal names appearing in the text to be processed; the clustering module is used for clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one; the global determining module is used for determining global role information at least based on at least one personal name set, wherein the global role information comprises at least one group of role information corresponding to at least one role one by one, each group of role information comprises a representative role name and an alias set of the corresponding role, and the alias set comprises personal names except the representative role name in the personal name set of the corresponding role; the text analysis module is used for carrying out text analysis on any target sentence in the text to be processed in combination with the global role information so as to obtain a text analysis result corresponding to the target sentence, wherein the text analysis comprises analysis on at least one preset item, and the at least one preset item comprises one or more of the following items: the method comprises the steps of text type, character name and character attribute, wherein the analysis of the text type refers to judging whether a target sentence belongs to a multi-role type, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age.
According to another aspect of the present invention, there is also provided a speech synthesis apparatus including the above text analysis apparatus, wherein the speech synthesis apparatus further includes: and the voice synthesis module is used for carrying out voice synthesis on the target sentence at least based on the text analysis result so as to obtain the synthesized voice corresponding to the target sentence.
According to another aspect of the present invention, there is also provided a speech synthesis apparatus including the above text analysis apparatus, wherein the speech synthesis apparatus further includes: the output module is used for outputting text result information, wherein the text result information comprises an initial text analysis result obtained by the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information; the receiving module is used for receiving text feedback information input by a user; the modification module is used for modifying the initial text analysis result based on the first modification information to obtain a new text analysis result when the text feedback information comprises the first modification information related to the initial text analysis result; and a speech synthesis module for performing speech synthesis on the target sentence based at least on the final text analysis result to obtain a final synthesized speech corresponding to the target sentence, wherein the final text analysis result is an initial text analysis result without modification of the initial text analysis result, and the final text analysis result is a new text analysis result with modification of the initial text analysis result.
According to the text analysis and speech synthesis method, device, system and storage medium, role information acquisition of the whole world (for example, the whole article) can be realized, and when text analysis is carried out on an independent target sentence later, the global role information can be utilized to assist in identifying each item of role information of the target sentence, for example, correction and the like on information such as gender, age and the like of an identification error. Thus, the accuracy and efficiency of text analysis can be greatly improved. Under the condition that text analysis is applied to the field of voice synthesis, the scheme is beneficial to improving the accuracy of subsequent voice synthesis, so that the user experience of a voice synthesis system can be greatly improved.
In the summary, a series of concepts in a simplified form are introduced, which will be further described in detail in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
Advantages and features of the invention are described in detail below with reference to the accompanying drawings.
Drawings
The following drawings are included to provide an understanding of the invention and are incorporated in and constitute a part of this specification. Embodiments of the present invention and their description are shown in the drawings to explain the principles of the invention. In the drawings of which there are shown,
FIG. 1 shows a schematic flow chart of a text analysis method according to one embodiment of the invention;
FIG. 2 shows a flow diagram of a text analysis method according to one embodiment of the invention
FIG. 3 shows a schematic flow chart of a speech synthesis method according to one embodiment of the invention;
FIG. 4 shows a schematic diagram of a process flow of a multi-persona, multi-emotion speech synthesis method in accordance with one embodiment of the present invention;
FIG. 5 illustrates a schematic diagram of text analysis in conjunction with global role information during speech synthesis in accordance with one embodiment of the present invention;
FIG. 6 shows a schematic block diagram of a text analysis device according to one embodiment of the invention; and
FIG. 7 shows a schematic block diagram of a text analysis system in accordance with one embodiment of the invention.
Detailed Description
In the following description, numerous details are provided to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the following description illustrates preferred embodiments of the invention by way of example only and that the invention may be practiced without one or more of these details. Furthermore, some technical features that are known in the art have not been described in detail in order to avoid obscuring the invention.
In order to at least partially solve the above technical problems, an embodiment of the present invention provides a text analysis method. By the method, global (such as the whole article) character information acquisition can be realized, and each item of character information of the target sentence can be identified in an auxiliary manner by using the global character information when text analysis is carried out on the single target sentence later, such as correction of information such as gender, age and the like of the identification error. Thus, the accuracy and efficiency of text analysis can be greatly improved. Under the condition that text analysis is applied to the field of voice synthesis, the scheme is beneficial to improving the accuracy of subsequent voice synthesis, so that the user experience of a voice synthesis system can be greatly improved.
According to one aspect of the invention, a text analysis method is disclosed. Fig. 1 shows a schematic flow chart of a text analysis method 100 according to one embodiment of the invention. As shown in FIG. 1, the text analysis method 100 includes steps S110-S150.
In step S110, a text to be processed is acquired.
The text to be processed may be text of any length including, but not limited to: an entire article (e.g., a whole novel), a section in an article, a paragraph in an article, or a sentence in an article, etc.
In step S120, name recognition is performed on the text to be processed to determine all the names of persons appearing in the text to be processed.
All nouns representing the name of a person (human pronouns, etc.) appearing in the text to be processed can be automatically identified by a name Named Entity Recognition (NER) model. Optionally, after identifying all of the names, the characteristics of each name may be analyzed, such as its number of occurrences, whether the surname is included, whether the name common words are included, and so forth.
In step S130, names belonging to the same character among all the names are clustered together to obtain at least one set of names corresponding to at least one character one by one.
Name clustering may also be referred to as name disambiguation. Illustratively, all the person names identified in step S120 may be clustered together by a clustering (or disambiguation) model. For example, "Wang Xiaoming," "mr. King," "king," and "manager" are names belonging to a role of "Wang Xiaoming," the names can be clustered together to form a name set. Each set of people names formed by the clusters corresponds to a role (i.e., a category). Alternatively, all roles obtained by clustering (i.e., all roles appearing in the text to be processed) may be taken as the at least one role. Optionally, all roles obtained by clustering (i.e. all roles appearing in the text to be processed) may be further screened, e.g. roles from which the number of occurrences exceeds a certain threshold are selected as the at least one role.
The step S130 may be implemented by any existing or future name clustering (or disambiguation) method, which will not be described herein.
In step S140, global role information is determined based at least on the at least one person name set, wherein the global role information includes at least one set of role information in one-to-one correspondence with the at least one role, each set of role information includes a representative role name of the corresponding role and an alias set including person names other than the representative role name in the person name set of the corresponding role.
In addition to at least one set of personal names, global role information may be determined in conjunction with other information. For example, global role information may be further determined in connection with text snippets containing each of the at least one set of personal names and/or specific sentences containing each of the at least one set of personal names. In this context, the text segment containing any person name may be a text segment containing only the character corresponding to the person name, or may be a text segment containing the character corresponding to the person name and a preceding character and/or a following character, where the preceding character is a first preset number of characters before the character corresponding to the person name, and the following character is a second preset number of characters after the character corresponding to the person name. The first preset number and the second preset number may be set as needed, may be any value, and may be equal or unequal.
Determining global role information based at least on the at least one set of personal names may include: for each character in the at least one character, a representative character name is selected from a set of person names corresponding to the character.
For any character, a person name may be selected from its set of person names as its representative character name, and the remaining person names in the set of person names may be used as aliases.
In one example, for any character, a person name may be randomly selected from its set of person names as its representative character name. In another example, a representative role name may be selected from a set of person names according to a preset criteria.
Illustratively, for each of the at least one persona, selecting a representative persona name from the set of personas corresponding to the persona includes: for each character in at least one character, analyzing the person name characteristics of each person name in the person name set corresponding to the character, wherein the person name characteristics comprise one or more of the following: the number of occurrences, whether the surname is contained, whether the common name is contained; and selecting the name with the name characteristics meeting the preset requirements from the name set corresponding to the character as the representative character name of the character.
The preset requirements can be set according to the needs. Illustratively, the preset requirements include: the number of occurrences is the greatest in the set of names corresponding to the character, or the number of occurrences is the greatest in names containing common characters of surnames and/or names.
In one example, the representative role name may be selected based only on the number of occurrences of the person name. For example, for any character, the name with the largest number of occurrences may be selected from its corresponding set of names as the representative character name. The scheme is simple to realize and has high selection speed.
In another example, for any character, a person's name containing a surname and/or a common name may be found in its corresponding set of person's names. If there is only one person name containing the common words of surnames and/or person names, the person name is taken as the representative role name. If there are multiple names containing common words of surnames and/or names, one can choose from, for example randomly, a certain name as the representative role name.
In yet another example, for any character, a person's name containing a surname and/or a common name may be found in its corresponding set of person's names. The name with the highest number of occurrences may then be selected as the representative role name from among the names that contain surnames and/or common words of names. The scheme can select the name with higher simultaneous occurrence frequency of common characters containing surnames and/or names as the representative role name. The probability of misidentification of names including common names and/or names is relatively low and subsequent text or speech processing is facilitated. Thus, such person names may be preferentially selected as representative character names. In addition, a person name with a high number of occurrences means a relatively large ratio in the entire text to be processed, compared with a person name with a low number of occurrences, and the use of such person names as representative character names facilitates selection of a more appropriate speech synthesis model in subsequent speech synthesis.
It will be appreciated that the above-described scheme of selecting representative character names is merely an example, and that the person name features may also include other types of features, and that the preset requirements may also be other forms of requirements.
Each set of role information may include a representative role name and alias set for the corresponding role. Through the global representative role names and the alias sets, the role names corresponding to the target sentences can be better identified when text analysis of the local target sentences is carried out later.
In one example, each set of role information may include only a representative role name and alias set for the corresponding role. In another example, each set of character information may also include other information, such as character attribute information, and the like. The character attribute information may include character gender information and/or character age information, etc.
In one example, global role information may be used as input to a role recognition algorithm (e.g., a role recognition model) when analyzing a role name corresponding to a target sentence, so as to recognize the role name corresponding to the target sentence through the role recognition algorithm. In another example, after the character name corresponding to the target sentence is recognized by the character recognition algorithm, the recognized character name may be corrected with reference to the global character information. For example, the role of "Wang Xiaoming" is obtained by statistics, and has four names "Wang Xiaoming", "mr. King", "manager", and the role name corresponding to the current target sentence is "manager", the role name corresponding to the target sentence can be modified from "manager" (alias) to "Wang Xiaoming" (representative role name).
In summary, through the representative role name and alias set of each role in the global role information, the identification of the role name for the target sentence can be facilitated.
In step S150, text analysis is performed on any target sentence in the text to be processed in combination with the global role information, so as to obtain a text analysis result corresponding to the target sentence, where the text analysis includes analysis on at least one preset item, and the at least one preset item includes one or more of the following: the text type refers to whether the target sentence belongs to a multi-role type or not, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age.
For example, text types may be classified into two types, a multi-persona type and a non-multi-persona type. In one example, the multi-persona type may include only dialect. In another example, the multi-persona type may include only dialogs and monologs. For example, the non-multi-persona type may include bystanding. For example, the text type of the target sentence may be analyzed first. If the target sentence is of a multi-role type, the role name and/or the role attribute of the target sentence can be continuously analyzed, and even the emotion category of the target sentence can be continuously analyzed. If the target statement is not of the multi-persona type, then other items may not be parsed.
The above describes the embodiment of identifying the role name corresponding to the target sentence by combining the representative role name and the alias set of each role in the global role information, which is not described herein.
Text analysis of the target sentence may also include analysis for character attributes. As described above, the character name corresponding to the target sentence can be identified in combination with the global character information. Specifically, the role name of the target sentence is identified, so that the role corresponding to the target sentence and the name set corresponding to the role can be determined. In one example, the role attributes of the roles corresponding to the target sentence may be analyzed in conjunction with the set of person names corresponding to the target sentence. In another example, a set of role information corresponding to any role in the global role information may itself include role attribute information. Thus, after the character name corresponding to the target sentence is identified, the character attribute information corresponding to the target sentence can be determined.
In one example, the target sentence may be first subjected to local text analysis in a conventional manner to obtain its initial text analysis result, and then further corrected using global role information. In another example, the analysis of items such as role names, role attributes, etc. may be performed on the target statement directly in conjunction with the global role information.
According to the text analysis method provided by the embodiment of the invention, global role analysis can be performed on the whole text to be processed, and global role information can be obtained. When text analysis is carried out on individual target sentences later, the global role information can be utilized to assist in identifying each item of role information of the target sentences. Therefore, the method is beneficial to correlating a large range of character information, and facilitates the identification and correlation of the same character, so that the accuracy and efficiency of text analysis can be greatly improved.
According to an embodiment of the present invention, each set of character information in the at least one set of character information further includes character attribute information of a corresponding character, and the character attribute information includes character gender information and/or character age information.
It is understood that the character sex information is information related to an item of character sex, that is, information indicating whether the character sex is male or female. The character age information is information related to an item of the character age, that is, information indicating what the character age is. Alternatively, the character age information may be a single age value or an age range. For example, the age of the character described as "gorgeous" in the text to be processed is 16 years old, and the character age information thereof may be 16 years old. For another example, if the text to be processed describes that the character "Xiaoli" is a senior citizen, it can be presumed that the character age information is an interval of 16 to 18 years. Optionally, the character age information may also be descriptive words related to age, for example, words such as "senior high school students", "college students", "young people", "elderly people", and the like.
According to an embodiment of the present invention, the character attribute information includes character gender information, and determining global character information based at least on the at least one set of personal names includes: for any specific character of the at least one character, determining the sex of the specific character through the sex classification result obtained by one or more of the first operation, the second operation and the third operation respectively, so as to obtain character sex information of the specific character; wherein the first operation comprises: searching a gender-indicating person name capable of indicating gender from a person name set corresponding to the specific role; if at least one person is found to indicate the name, analyzing the gender of the specific role based on the gender indicated by the at least one person to obtain a first gender classification result; wherein the second operation comprises: searching one or more personal name pronoun sets which are matched with any one or more personal names in a personal name set corresponding to a specific role in a to-be-processed text in a one-to-one correspondence manner, wherein each personal name pronoun set comprises one or more personal name pronouns; if the at least one set of personal pronouns is found, analyzing the gender of the specific role based on the gender indicated by each personal pronoun in the at least one set of personal pronouns to obtain a second gender classification result; wherein the third operation comprises: and inputting the text segment containing the name, the specific sentence containing the name and the context of the specific sentence into a gender classification model for each name in the name set corresponding to the specific character to obtain a third gender classification result output by the gender classification model, wherein the third gender classification result is used for indicating the gender of the specific character.
For example, the specific character may be each of at least one character, that is, for each of at least one character, the sex thereof may be determined using the scheme of the present embodiment.
The sex of the specific character may be determined by a sex classification result obtained by each of one or more of the first operation, the second operation, and the third operation. If the sex of the specific character is determined only through a single operation among the first, second and third operations, the sex determined by the operation may be directly regarded as the sex of the specific character to obtain final character sex information. If the sex of the specific character is determined through a plurality of operations among the first, second and third operations, the sex classification results of the plurality of operations may be further processed, for example, weighted average, to obtain a new sex classification result, and the sex of the specific character is determined based on the new sex classification result to obtain final character sex information.
In the first operation, for any character, the character gender may be checked from its representative character name and alias set, for example, whether there is a gender-indicating person name capable of indicating gender. The gender-indicating person name may be, for example: some women/mr/Miss/lady/Taitai/girl/boy/milk/grandfather/father/mom/high-ranked imperial concubine/queen/tai-post/prince/merry/Ji/princess/gale, etc. The character gender may be determined based on the names of the gender-indicating persons if such persons are present.
Illustratively, character sexes may be divided into at least two sexes. The first sex classification result may be or include sex ratio information (first sex ratio described below) in one-to-one correspondence with each sex. Optionally, the first gender classification result may also be or include gender information related to a unique gender. The at least two sexes may include both male and female sex types, or include both male and female, unknown sex types. Other division methods of the character gender are also possible, and are not described in detail herein. For example, the name set of a certain character includes 5 names indicating names of persons, wherein 3 indicate men and 2 indicate women, the first gender classification result may be intermediate information such as "the ratio of men is 3/5 and the ratio of women is 2/5", or the first gender classification result may be conclusive information such as "men".
In the second operation, for any character, a human-named-pronoun matching any one or more of the human names in the human-name set of the character can be identified from the text to be processed through a preset rule, and the sex of the character is determined through the human-named-pronouns. The human pronouns may be such as: he/she/it/mr/woman, etc. The preset rule may be, for example, a proximity rule, which will be described later.
Note that each person name may appear one or more times in the text to be processed, and that a matching person pronoun may or may not be identified at each occurrence of the person name. Thus, assume that any person name appears n 1 Second, the matching human pronoun may be n 2 Second time, then there is n 2 ≤n 1 ,n 1 ≥1,n 2 And is more than or equal to 0. For example, the personal name set of character a includes 2 personal names in total, in which the 1 st personal name appears 10 times, 8 matching personal pronouns (personal pronouns matching the 1 st personal name are found)Set) the 2 nd personal name appears 20 times, 15 matching personal pronouns are found (set of personal pronouns matching the 2 nd personal name), then all the personal pronouns corresponding to the character a are 23.
Similar to the first gender classification result, the second gender classification result may be or include gender ratio information (second gender ratio described below) in one-to-one correspondence with each gender. Optionally, the second gender classification result may also be or include gender information related to a unique gender. Along the above example, assuming that all the human pronouns corresponding to the character a are 23, 19 of which indicate men and 4 indicate women, the second sex classification result may be intermediate information such as "male duty 19/23, female duty 4/23", or the second sex classification result may also be conclusive information such as "men".
In a third operation, the gender of the character may be determined by constructing a gender classification model. In one example, a text segment containing the person name, a particular sentence containing the person name, and a context input gender classification model for the particular sentence may be processed. It will be appreciated that a person's name may appear one or more times, and in the case of multiple occurrences, the text segment containing the person's name, the particular sentence containing the person's name, and the context of the particular sentence each have multiple copies. Optionally, the first gender classification result and/or the second gender classification result may be further input into the gender classification model. The gender classification model may be any suitable neural network model. For example, the gender classification model may be implemented using one or more of a convolutional neural network model (CNN), a recurrent neural network model (RNN), a transducer model (transducer), and the like.
Illustratively, the third gender classification result includes gender probabilities in one-to-one correspondence with at least two genders, which may be used to represent the probabilities that the genders of the particular personas are the corresponding genres. The gender classification model may directly output the probabilities for each gender. Of course, the gender classification model may also output the gender information related to the unique gender.
According toAccording to the embodiment of the invention, the gender of the character is divided into at least two sexes, wherein analyzing the gender of the specific character based on the gender indicated by the at least one persona indication name to obtain the first gender classification result comprises the following steps: determining the gender indicated by each of the at least one personally indicated person name; for each of at least two sexes, calculating a ratio of a number of gender-indicative personal names belonging to the gender to a total number of at least one of the individual-indicative personal names to obtain a first gender ratio corresponding to the genderP i1 The first gender classification result comprises a first gender ratio corresponding to at least two sexes one by oneP i1 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……NNis the total number of sexes in at least two sexes; and/or analyzing the gender of the particular persona based on the gender indicated by each of the set of at least one personal pronoun to obtain a second gender classification result comprises: determining a gender indicated by each of the set of at least one personal pronoun; for each of at least two sexes, calculating a ratio of a number of human pronouns belonging to the gender to a total number of all human pronouns in the at least one set of human pronouns to obtain a second gender ratio corresponding to the gender P i2 The second gender classification result comprises a second gender ratio corresponding to at least two sexes one by oneP i2 The method comprises the steps of carrying out a first treatment on the surface of the And/or the third gender classification result comprises gender probabilities corresponding to at least two sexes one to oneP i3 Sex probabilityP i3 For representing the probability that the sex of a particular character is the corresponding sex.
The first gender duty cycle has been described above in connection with the exampleP i1 Second sex ratioP i2 And gender probabilityP i3 The calculation method of (2) is not described in detail herein.
Calculating, for each of the at least two sexes, a ratio of the number of sex-indicative personal names belonging to the sex to the total number of at least one individual-indicative personal name to obtain a first corresponding to the sexSex ratioP i1 The step of (a) may be performed in a case where at least one of the genres indicated by the person names is not uniform. If the sexes indicated by the at least one persona indication name are uniform, for example, both sexes are indicated by females, the sex indicated by the at least one persona indication name can be directly used as the sex of the specific character, so that a first gender classification result can be obtained.
Similarly, the above-described ratio of the number of human pronouns belonging to at least two genders to the total number of all human pronouns in at least one set of human pronouns is calculated for each of the at least two genders to obtain a second gender ratio corresponding to the gender P i2 The step of (a) may be performed in the event that the gender indicated by all of the personal pronouns in the at least one set of personal pronouns is not uniform. If the genres indicated by all the human pronouns in the at least one human pronoun set are uniform, for example, indicate females, the genres indicated by all the human pronouns in the at least one human pronoun set may be directly taken as the genres of the specific characters to obtain the second gender classification result.
Relative to gender information associated with a unique gender, a first gender ratioP i1 Second sex ratioP i2 And gender probabilityP i3 The method has the advantages that the contained information is more abundant, and other processing is convenient to follow, for example, the comprehensive of various gender classification results is convenient to obtain more accurate gender information and the like.
According to an embodiment of the present invention, determining the sex of the specific character through the sex classification result obtained by each of one or more of the first operation, the second operation, and the third operation to obtain the character sex information of the specific character includes: determining a first gender ratio based on the first gender classification resultP i1 The maximum gender is the gender of the specific role, so as to obtain the role gender information; alternatively, the second gender ratio is determined based on the second gender classification result P i2 The maximum gender is the gender of the specific role, so as to obtain the role gender information; or determining a gender summary based on the third gender classification resultRate ofP i3 The maximum gender is the gender of the specific role, so as to obtain the role gender information; alternatively, for each of at least two sexes, the first sex corresponding to that sex is determined to be the ratioP i1 Second sex ratioP i2 And gender probabilityP i3 And (3) carrying out weighted average or direct addition on two or three of the gender and the gender with the largest total gender ratio to obtain the gender information of the specific roles.
In the case of determining the sex of a character based on a single operation, the sex having the largest duty ratio or probability may be directly selected as the sex of a specific character. In the case of determining the sex of a character based on a plurality of operations, the duty ratio of the sex may be obtained by weighted averaging or directly adding the duty ratios or probabilities calculated for the same sex by the operations, and the sex of the character may be determined.
According to an embodiment of the present invention, determining the sex of the specific character through the sex classification result obtained by each of one or more of the first operation, the second operation, and the third operation to obtain the character sex information of the specific character includes: first, performing a first operation, and determining character gender information based on a first gender classification result if a first condition is satisfied, wherein the first condition includes: searching at least one person name indicating person name; and if the first condition is not satisfied, performing a second operation, and if the second condition is satisfied, determining character gender information based on a second gender classification result, wherein the second condition includes: searching at least one personal name pronoun set; and if the second condition is not satisfied, performing a third operation and determining character gender information based on the third gender classification result.
Illustratively, the step of determining character gender information based on the first gender classification result may include: determining a first gender ratio based on the first gender classification resultP i1 The maximum sex is the sex of the target character to obtain character sex information. The step of determining character gender information based on the second gender classification result may include: based on the second gender classification resultDetermining the second sex ratioP i2 The maximum sex is the sex of the target character to obtain character sex information. The step of determining character gender information based on the third gender classification result may include: determining gender probabilities based on third gender classification resultsP i3 The maximum sex is the sex of the target character to obtain character sex information.
In an embodiment in which the sex of the character is determined by means of a plurality of the first, second and third operations, the first, second and third operations may be sequentially performed. For example, for any character, the sex of the character may be determined first in the manner of the first operation, and if there is no sex-indicating person name, the sex of the character may be determined continuously in the manner of the second operation, and if there is no person-to-person pronoun, the sex of the character may be determined continuously in the manner of the third operation. The first operation mode only needs to check the name set of the character, has less calculation amount, and can more quickly identify the sex of the character. Accordingly, the scheme of determining the sex of the character in the first operation may be implemented first. The second operation needs to check whether a human-called pronoun exists in the text outside the human name set, and the scheme is larger than the first operation in operation amount, so that the method can be implemented in a supplementary mode under the condition that the sex of the character cannot be determined through the scheme of the first operation, and further the sex of the character can be determined. The third operation involves the application of a gender classification model, which typically requires training and is computationally intensive with a large amount of parameters. Therefore, the sex classification model may be further employed to determine the sex of the character in the case where the schemes of the above two operations cannot determine the sex of the character.
According to an embodiment of the present invention, the first condition further includes: in the first gender classification result, the maximum first gender ratioP i1 Greater than a first proportional threshold; and/or the second condition further comprises: in the second gender classification result, the maximum second gender ratioP i2 Greater than a second proportional threshold.
The first and second proportional thresholds may be any suitable values, which may be set as desired, and the invention is not limited in this regard. For example, the first and second proportional thresholds may be the same or different.
In considering whether to continue the second operation or the third operation, it may be further considered whether the accuracy of the sex classification result obtained in the previous operation satisfies the requirement. For example, if in the first gender classification result, the maximum first gender ratioP i1 Not greater than the first ratio threshold, that may mean a maximum first gender ratioP i1 The corresponding sex is too small in the ratio difference with other sexes. For example, the first sex ratio for menP i1 16/30 of the female sex ratioP i1 14/30, the difference between the two is relatively small, and thus the reliability of the result of this gender classification is relatively low, at which time the second operation may be selected to continue. An embodiment of the second condition may be understood by referring to the above example, and will not be described in detail.
Based on the scheme, whether the subsequent operation is executed or not can be selected through the reliability of the gender classification result of the previous operation, and if the reliability is higher, the gender of the character can be determined by adopting the gender classification result of the current operation, so that the workload is reduced; if the reliability is low, the subsequent operation can be continuously adopted to further determine the sex of the character so as to improve the identification accuracy of the sex. Therefore, the scheme can achieve the balance of workload and sex identification accuracy as much as possible.
According to the embodiment of the invention, searching one or more personal name pronoun sets which are matched with any one or more personal names in the personal name set corresponding to the specific role in a to-be-processed text in a one-to-one correspondence manner comprises the following steps: for each person name in a person name set corresponding to a specific role, acquiring at least one specific sentence containing the person name and respective context of the at least one specific sentence; for each particular sentence of the at least one particular sentence, a human being's pronoun closest to the name of the person is found from the particular sentence and the context of the particular sentence as a human being's pronoun matching the name of the person.
As described above, the preset rule may be, for example, a proximity rule. The nearest principle is to find out the name pronoun closest to any person name as the name pronoun matched with the person name. When searching for a person name pronoun closest to any person name, the specific sentence adopted is the sentence in which the person name is located. In this context, the context of any sentence (a particular sentence or target sentence) may include a sentence that is within a first predetermined number of sentences or a first predetermined number of words above the sentence and/or a sentence that is within a second predetermined number of sentences or a second predetermined number of words below the sentence. The first preset sentence number, the first preset word number, the second preset sentence number, and the second preset word number may be set to any suitable values as needed. The first predetermined number of sentences and the second predetermined number of sentences may be the same or different. The first predetermined number of words may be the same as or different from the second predetermined number of words. For example, it may be specified that only sentences located within 5 sentences above and below a specific sentence are selected as their context, and that none of the sentences exceeding 5 sentences (e.g., the 6 th sentence above and the 6 th sentence below the specific sentence) joins in the search of a human pronoun matching the current name of the person. In another example, the edges of the context of the particular sentence may be unrestricted, i.e., the person's pronoun closest to the current person's name may be found throughout the text to be processed.
As described above, for any character, although a matching personal name pronoun is searched for each person name in the set of person names corresponding to the character, each person name may search for a matching personal name pronoun, or at least some person names may not search for a matching personal name pronoun. Thus, the number of the final identified set of human pronouns may be one or more, which are in one-to-one correspondence with one or more of the set of human names. Of course, the number of human pronouns eventually identified may also be zero.
According to an embodiment of the present invention, the character attribute information includes character age information, and determining global character information based at least on the at least one set of personal names includes: determining the age of the specific character through the age prediction result obtained by the fourth operation and/or the fifth operation for any specific character in the at least one character to obtain the character age information of the specific character; wherein the fourth operation comprises: searching an age indication person name capable of indicating the age from a person name set corresponding to the specific role; if at least one age indicating person name is found, analyzing the age of the specific character based on the age indicated by the at least one age indicating person name to obtain a first age prediction result; wherein the fifth operation comprises: and inputting the text segment containing the name, the specific sentence containing the name and the context of the specific sentence into an age prediction model for each name in the name set corresponding to the specific character to obtain a second age prediction result output by the age prediction model, wherein the second age prediction result is used for indicating the age of the specific character.
For example, the specific character may be each of the at least one character, that is, for each of the at least one character, the age thereof may be determined using the scheme of the present embodiment.
Similarly to the character gender, the age of the specific character may be determined by the age prediction result obtained by each of the fourth operation and/or the fifth operation. If the age of the specific character is determined only through the fourth operation or the fifth operation, the age determined through the fourth operation or the fifth operation may be directly regarded as the age of the specific character to obtain final character age information. If the age of the specific character is determined through the fourth and fifth operations, the age prediction results of the two operations may be further processed, for example, weighted average, to obtain a new age prediction result, and the age of the specific character is determined based on the new age prediction result to obtain final character age information.
In the fourth operation, for any character, the age of the character may be checked from its representative character name and alias set, for example, whether or not there is an age-indicating person name capable of indicating the age. The age-indicating person name may be, for example: some women/mr/Miss/lady/Taitai/girl/boy/milk/grandfather/father/mom/high-ranked imperial concubine/queen/tai-post/prince/merry/Ji/princess/gale, etc. If age indicates person names, a role age may be determined based on the person names.
For example, the age of a character may be divided into at least two age categories. As described above, the age of a character may be a single value or an age range. That is, each age category may be represented by a separate value, or by an age interval. Furthermore, each age category may also be represented by a descriptive vocabulary (e.g., "young" or the like). The first age prediction result may be or include age ratio information (first age ratio described below) in one-to-one correspondence with each age category. Optionally, the first age prediction result may also be or include age information related to a unique age category. The at least two age categories may include a plurality of categories such as children, teenagers, young adults, elderly, and the like. Other ways of dividing the age of the character are also possible, and are not described in detail herein. For example, the person name set of a certain character includes 10 age-indicated person names, wherein 3 age-indicated person names indicate teenagers, 4 age-indicated person names indicate young age groups, and 3 age-indicated person names, and the first age prediction result may be intermediate information such as "teenager ratio 3/10, young age ratio 4/10, and elderly age ratio 3/10", or conclusive information such as "young age" as well.
In the fifth operation, the age of the character may be determined by constructing an age prediction model. In one example, a text snippet containing the person name, a particular sentence containing the person name, and a context of the particular sentence may be processed into an age prediction model. It will be appreciated that a person's name may appear one or more times, and in the case of multiple occurrences, the text segment containing the person's name, the particular sentence containing the person's name, and the context of the particular sentence each have multiple copies. Optionally, the first age prediction result may be further input into the age prediction model. The age prediction model may be any suitable neural network model. For example, the age prediction model may be implemented using one or more of a convolutional neural network model (CNN), a recurrent neural network model (RNN), a converter model (transducer), and the like.
Illustratively, the second age prediction result includes age probabilities in one-to-one correspondence with at least two age categories, and the age probabilities may be used to represent probabilities that the ages of the specific characters are the corresponding age categories. The age prediction model may directly output the probability of each age category. Of course, the age prediction model may also output the age information related to the unique age category.
According to an embodiment of the present invention, the age of the character is divided into at least two age categories, wherein analyzing the age of the specific character based on the age indicated by the at least one age-indicated person name to obtain the first age prediction result comprises: determining an age category indicated by each of the at least one age-indicated person name; for each of at least two age categories, calculating a ratio of a number of age-indicative personal names belonging to the age category to a total number of at least one age-indicative personal name to obtain an age-to-weight ratio corresponding to the age categoryP i4 The first age prediction result comprises age ratios corresponding to at least two age categories one by oneP i4 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……MMis the total number of age categories in the at least two age categories; and/or the second age result comprises age probabilities corresponding to at least two age categories one to oneP i5 Age probabilityP i5 For representing the probability that the age of a particular character is the corresponding age category.
Age ratio has been described above in connection with examplesP i4 And age probabilityP i5 The calculation method of (2) is not described in detail herein.
For each of the at least two age categories, calculating a ratio of the number of age-indicative personal names belonging to the age category to the total number of at least one age-indicative personal name to obtain an age-to-weight ratio corresponding to the age P i1 The step of (a) may be performed in case that at least one age indicated by the age indicated person name is not uniform. If the ages indicated by the at least one age-indicated person name are uniform, e.g., all 25-35 years old, the age indicated by the at least one age-indicated person name may be directly taken as the specific characterTo obtain a first age prediction result.
Age ratio relative to age information associated with a unique age categoryP i4 And age probabilityP i5 The information contained is richer, and other processing is convenient to follow, for example, the comprehensive of various age prediction results is convenient to obtain more accurate age information.
According to an embodiment of the present invention, determining the age of the specific character by the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character includes: determining an age ratio based on the first age prediction resultP i4 The maximum age category is the age of the specific role to obtain role age information; alternatively, the age probability is determined based on the second age prediction resultP i5 The maximum age category is the age of the specific role to obtain role age information; or for each of at least two age categories, the age ratio corresponding to the age category is calculated P i4 And age probabilityP i5 And carrying out weighted average or direct addition to obtain the total age ratio, and determining the age category with the maximum total age ratio as the age of the specific role to obtain the role age information.
In the case where the age of a character is determined based on a single operation, the age category having the largest ratio or probability may be directly selected as the age of the specific character. When determining the age of a character based on a plurality of operations, the duty ratio of the age class may be obtained by weighted averaging or directly adding the duty ratios or probabilities calculated for the same age class by the operations, and the age of the character may be determined.
According to an embodiment of the present invention, determining the age of the specific character by the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character includes: first, a fourth operation is performed, and character age information is determined based on the first age prediction result if a third condition is satisfied, wherein the third condition includes: finding at least one age-indicated person name; if the third condition is not satisfied, a fifth operation is performed and character gender information is determined based on the second age prediction result.
Illustratively, the step of determining the character age information based on the first age prediction result may include: determining an age ratio based on the first age prediction resultP i4 The largest age category is the age of the target character to obtain character age information. The step of determining the character age information based on the second age prediction result may include: determining an age probability based on the second age prediction resultP i5 The largest age category is the age of the target character to obtain character age information.
The age of the character may be determined by means of the fourth and/or fifth operations described above. In an embodiment in which the age of the character is determined by means of the fourth and fifth operations, the fourth and fifth operations may be sequentially performed. For example, for any character, the age of the character may be first determined in the fourth operation, and if there is no age indicating person name, the age of the character may be continuously determined in the fifth operation. The fourth operation mode only needs to check the name set of the character, has less calculation amount and can more quickly identify the age of the character. Therefore, the scheme of determining the age of the character in the fourth operation may be implemented first. The fifth operation involves the application of an age prediction model, which usually requires training and is computationally expensive with a large amount of parameters. Therefore, the scheme of implementing the fifth operation may be supplemented in the case that the age of the character cannot be determined by the scheme of the fourth operation, thereby helping to determine the age of the character.
According to an embodiment of the present invention, the third condition further includes: in the first age prediction result, the maximum age ratioP i4 Greater than a third proportional threshold.
The third ratio threshold may be any suitable value, which may be set as desired, and the invention is not limited thereto. The third proportional threshold may be the same as or different from any of the first and second proportional thresholds described above, for example.
In considering whether to continue the fifth operation, it may be further considered whether the accuracy of the age prediction result obtained by the fourth operation satisfies the requirement. For example, if in the first age prediction result, the largest age category ratioP i1 Not greater than the third ratio threshold, that may mean the maximum age ratioP i1 The corresponding age category has too small a gap in the ratio between other age categories. Therefore, the reliability of this age prediction result is relatively low, at which time the fifth operation may be selected to continue.
Based on the scheme, whether the subsequent operation is executed or not can be selected through the reliability of the age prediction result of the previous operation, and if the reliability is higher, the age of the character can be determined by adopting the age prediction result of the current operation, so that the workload is reduced; if the reliability is low, the follow-up operation can be continued to further determine the age of the character so as to improve the prediction accuracy of the age. Therefore, the scheme can reach the balance of workload and age prediction accuracy as much as possible.
According to an embodiment of the present invention, determining global role information based at least on at least one set of personal names comprises: for each character in at least one character, analyzing the person name characteristics of each person name in the person name set corresponding to the character, wherein the person name characteristics comprise one or more of the following: the number of occurrences, whether the surname is contained, whether the common name is contained; and selecting the name with the name characteristics meeting the preset requirements from the name set corresponding to the character as the representative character name of the character.
According to an embodiment of the present invention, the preset requirements include: the number of occurrences is the greatest in the set of names corresponding to the character, or the number of occurrences is the greatest in names containing common characters of surnames and/or names.
The above describes embodiments of determining representative role names according to person name features, and is not described in detail herein.
According to the embodiment of the invention, before any target sentence in the text to be processed is subjected to text analysis by combining the global role information, the method further comprises the following steps: outputting global role information; receiving modification information related to global role information input by a user; the global role information is modified based on the modification information.
After the global role information is identified, the place where the error exists in the global role information can be modified by the user. In this way, the accuracy of the global role information can be improved. Any suitable output device may be employed to output the global role information. In one example, the speech synthesis system may output the global role information directly through an output device, which may be implemented using a display screen and/or speakers. In another example, the speech synthesis system may output global role information to the client through the first output device, and then the client outputs the global role information for viewing by the user through the second output device thereof, where the first output device may be implemented by a wired and/or wireless communication device, and the second output device may be implemented by a display screen and/or a speaker.
According to the embodiment of the invention, the text analysis of any target sentence in the text to be processed by combining the global role information comprises the following steps: analyzing the text type of the target sentence to obtain an analysis result corresponding to the text type, wherein the text analysis result comprises the analysis result corresponding to the text type; under the condition that the target sentence belongs to the multi-role type, identifying an initial role name corresponding to the target sentence at least based on the target sentence; retrieving target character information containing an initial character name from the global character information, the target character information corresponding to the target character; for any specific preset item in the at least one preset item, extracting information corresponding to the specific preset item from the target role information, and determining the extracted information as an analysis result corresponding to the specific preset item in the text analysis result, wherein in the case that the specific preset item is a role name, the extracted information is a representative role name in the target role information.
For preset items other than the specific preset item, a conventional analysis method may be adopted to analyze the preset items to obtain a corresponding analysis result.
It is understood that the step of retrieving target character information including an initial character name from global character information and the step of extracting information corresponding to a specific preset item from the target character information for any specific preset item of at least one preset item and determining the extracted information as an analysis result corresponding to the specific preset item from among text analysis results are performed in a case that the target sentence belongs to a multi-role type.
The step of identifying the initial character name corresponding to the target sentence based at least on the target sentence may be implemented by using a conventional character identification method, and the identification at this time may be character name identification based on local information. For example, at least one candidate character name may be identified from the target sentence and its context, and a character name corresponding to the target sentence may be found from the at least one candidate character name. Illustratively, identifying the initial role name corresponding to the target sentence based at least on the target sentence may include: one or more of the target sentence, the context of the target sentence, the text snippet containing at least one candidate character name, and the global character information is entered into the character recognition model to determine an initial character name.
For example, knowing that the sender of the current target sentence is the "king" role from the target sentence and its context analysis, the "king" may be taken as the initial role name corresponding to the current target sentence.
The specific preset item may include a character name and/or a character attribute. For example, if the initial role name corresponding to the target sentence is found to belong to one person name in the alias set corresponding to a certain role in the global role information, the initial role name may be overlaid with the representative role name corresponding to the role in the global role information as a new role name, that is, the representative role name corresponding to the role is regarded as the final role name corresponding to the target sentence. In this case, the specific preset item is a character name. For example, the initial role name corresponding to the target sentence is "king", and an alias of "king" belonging to "Wang Xiaoming" is recorded in the global role information, and "king" may be overlaid with "Wang Xiaoming" as the role name corresponding to the target sentence.
For example, if the analysis result corresponding to a certain preset item in the text analysis result is found to be not satisfactory, a set of role information of the target role corresponding to the target sentence can be found out from the global role information, and the original analysis result is replaced by the information corresponding to the preset item in the set of role information. The preset item with the replaced analysis result is the specific preset item.
By the method, the situation that the locally recognized character information is missed or nonuniform can be automatically corrected by utilizing the global character information, so that the recognition accuracy of the character information can be improved.
According to an embodiment of the present invention, before extracting information corresponding to a specific preset item from target character information for any specific preset item of at least one preset item, and determining that the extracted information is an analysis result corresponding to the specific preset item in text analysis results, the method further includes: under the condition that the target sentence belongs to the multi-role type, analyzing the character attribute corresponding to the target sentence to obtain an analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute, wherein the text analysis result comprises the analysis result corresponding to the character attribute; if the confidence coefficient of the analysis result corresponding to the character attribute is lower than a preset confidence coefficient threshold value, determining that the character attribute is a specific preset item; for any specific preset item in the at least one preset item, extracting information corresponding to the specific preset item from the target role information, and determining that the extracted information is an analysis result corresponding to the specific preset item in the text analysis results comprises: extracting character attribute information from the target character information; and overlaying an analysis result corresponding to the character attribute in the text analysis result by using the extracted character attribute information.
When the color attribute is analyzed, the confidence of the analysis result can be obtained at the same time of obtaining the analysis result. For example, the sex classification model may be used to predict the sex of the character a corresponding to the target sentence and output the predicted sexThe result of (a) indicates that the character is sexed male with confidence (similar to the sex probability described above)P i3 ) 60% indicates that the sex of the character is male with a probability of 60% and female with a probability of 40%. Assuming that the preset confidence threshold is 75%, since the confidence of the character gender analysis result of the character a is lower than the preset confidence threshold, the analysis result corresponding to the character gender item, that is, the original analysis result of "male" can be overlaid with the character gender information related to the character a in the global character information. Thus, if the character gender information related to the character a in the global character information is "female", the character gender analysis result of the current target sentence can be corrected to a more correct result. This helps to improve the accuracy of subsequent speech synthesis and thus the user experience.
According to an embodiment of the present invention, the division of the character attribute into at least two types of attributes, and the analysis of the character attribute corresponding to the target sentence to obtain the analysis result corresponding to the character attribute and the confidence level of the analysis result corresponding to the character attribute include: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; acquiring a person name set corresponding to a target role from global role information; identifying all text fragments of the target role from the text to be analyzed based on the name set corresponding to the target role; performing attribute identification on each text segment in all text segments to determine attributes corresponding to all text segments one by one; for each of at least two types of attributes, calculating the ratio of the number of text fragments corresponding to the attribute to the total number of all text fragments to obtain an attribute duty ratio; selecting the attribute with the largest attribute proportion as the character attribute corresponding to the target sentence to obtain an analysis result corresponding to the character attribute; and determining the attribute proportion of the attribute with the largest attribute proportion as the confidence of the analysis result corresponding to the character attribute.
The length of the text to be analyzed can be set as required. Preferably, the text to be analyzed is a local text, i.e. the length of which is smaller than the length of the entire text to be processed.
For example, the character attribute is a character sex, and the character sexes are divided into two major categories of male and female. Assuming that the person name set of character a includes 3 person names in total, respectively appearing 5 times, 10 times, and 3 times in the text to be analyzed, character a appears 18 times in total. And respectively carrying out gender identification on the text fragments corresponding to the 18 times to obtain 18 personality identification results. Counting the proportion of each sex in the 18 character recognition results, for example, the proportion of males is 12/18, and the proportion of females is 6/18, determining the sex corresponding to the target sentence as male, and taking 12/18 as the confidence coefficient. If 12/18 is lower than the preset confidence threshold, the current gender analysis result of the target sentence is considered to be inaccurate, and the current gender analysis result can be covered by the character gender information corresponding to the character A in the global character information.
When the attribute of the character is age of the character, the scheme for calculating the confidence coefficient is similar to the example of the gender, and is not repeated.
The scheme is a statistical scheme, and the frequency of local occurrence of a certain attribute of the same role is used as the confidence level to measure whether the analysis result of the attribute needs to be corrected.
According to an embodiment of the present invention, analyzing a character attribute corresponding to a target sentence to obtain an analysis result corresponding to the character attribute and a confidence level of the analysis result corresponding to the character attribute includes: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; inputting the text to be analyzed into the attribute identification model to obtain the analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute output by the attribute identification model.
Optionally, one or more of the global character information, the analysis result corresponding to the character name among the text analysis results, and the at least one candidate character name may also be input into the attribute identification model.
The scheme adopts the model to carry out attribute identification, so that the probability corresponding to the attribute with the highest probability output by the model can be directly used as the confidence of the analysis result.
According to the embodiment of the invention, identifying the initial role name corresponding to the target sentence at least based on the target sentence comprises: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; extracting initial candidate role names based on the text to be analyzed; determining the initial candidate role name as the final candidate role name; and inputting the global character information, the final candidate character names and the text to be analyzed into a character recognition model to obtain a character recognition result output by the character recognition model, wherein the character recognition result is used for indicating the character name corresponding to the target sentence as an initial character name.
According to the embodiment of the invention, when the target sentence belongs to the multi-role type, after identifying the initial role name corresponding to the target sentence based on at least the target sentence, performing text analysis on any target sentence in the text to be processed by combining the global role information further comprises: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; and inputting the global character information, the initial character name and the text to be analyzed into an attribute identification model to obtain an analysis result which is output by the attribute identification model and corresponds to the character attribute.
As described above, global character information may be used as one of inputs to the character recognition model or attribute recognition model when analyzing local character names or character attributes, so that these models can predict the character names or character attributes by referring to the global character information. This approach may improve the predictive performance of the character recognition model or the attribute recognition model.
According to an embodiment of the present invention, clustering names belonging to the same character among all the names together to obtain at least one name set corresponding to at least one character one by one includes: clustering the names belonging to the same role in all the names together to obtain a name set corresponding to all the roles in the text to be processed one by one; and selecting the roles with the number of the names in the corresponding name set larger than a preset number threshold from all the roles as at least one role.
If the number of occurrences of the name of a character in the entire text to be processed is too small, it is difficult to collect relatively rich character information, which may result in a low accuracy of global character information, and thus, a low reference meaning of global character information. In this case, it may be considered that such roles are excluded when global role information is acquired. That is, global character information is collected only for characters whose number of occurrences is greater than a threshold value.
Fig. 2 shows a flow diagram of a text analysis method 100 according to one embodiment of the invention. As shown in fig. 2, after the text to be processed is acquired, a person name NER model may be used to detect the person name in the text. Then, a representative role name and alias set for each role may be determined by clustering. Subsequently, the age and sex of each character can be identified, and global character information including information of a representative character name, an alias set, sex, age, and the like of each character is finally obtained.
In order to at least partially solve the technical problem that the final speech synthesis effect cannot be ensured by the existing multi-role multi-emotion speech synthesis technology simply relying on a model, the embodiment of the invention provides a semi-automatic multi-role multi-emotion speech synthesis technology, which can combine the characteristics of high efficiency of the model, accuracy of manual intervention and the like, and can obtain high-quality synthesized speech while ensuring higher efficiency so as to better meet the requirements of application scenes such as audio books and the like which need multi-role multi-emotion reading.
According to one aspect of the present invention, a speech synthesis method is disclosed, which is primarily a multi-persona, multi-emotion speech synthesis method. Fig. 1 shows a schematic flow chart of a speech synthesis method 300 according to one embodiment of the invention. As shown in fig. 1, the speech synthesis method 300 includes steps S310-S360.
In step S310, a text to be processed is acquired.
The text to be processed may be text of any length including, but not limited to: an entire article (e.g., a whole novel), a section in an article, a paragraph in an article, or a sentence in an article, etc.
In step S320, text analysis is performed on any target sentence in the text to be processed to obtain an initial text analysis result, where the text analysis includes analysis of at least one preset item, and the at least one preset item includes one or more of the following: text type, character name, character attribute and emotion category, wherein the analysis of the text type refers to judging whether the target sentence belongs to a multi-role type, the multi-role type comprises dialect, and the character attribute comprises character gender and/or character age.
The target sentence may be any one or more sentences (i.e., sentences) in the text to be processed. Any existing or future possible sentence extraction technique may be used to extract the target sentence from the text to be processed. For example, the double quotation marks are typically internal to the text of the text, and therefore, the text of the dialog can be identified by detecting the double quotation marks. After identifying the text, the text may be divided into sentences, any one of which may be used as a target sentence. The granularity of the statement division can be set as required. For example, for the dialect in double quotation marks, the whole dialect in double quotation marks can be taken together as one target sentence. Alternatively, the dialect in the double quotation mark can be split into short sentences, each of which is regarded as a target sentence. By way of example and not limitation, a dialect may be split in punctuation, such as once every period is encountered, or once every comma and period are encountered.
And performing multi-role and multi-emotion text analysis on the target sentence. The multi-persona, multi-emotion text analysis may first include text type analysis. After the target sentence is obtained from the text to be processed, the text type of the target sentence can be judged. Alternatively, the text type of the target sentence may be determined in the process of acquiring the target sentence. The text type refers to whether the target sentence is monologue, dialect, or bystander. Illustratively, a target sentence within a particular punctuation mark (e.g., double-quotation mark) may be considered to be dialect. For example, the target sentence or the target sentence and its context may be input into a pre-trained text type analysis model for analysis, i.e., the text type is regarded as a classification task, and classified by the text type analysis model, thereby determining the text type of the target sentence. For sentences whose text types are monologue or dialect, an acoustic model of the speaker (i.e., the speech synthesis model described herein) that matches the monologue or dialect can be obtained from the name of the character from which the monologue or dialect was issued, and the speaker's speech can be synthesized accordingly. And for sentences with text types that are bystanding, the acoustic model of the default speaker can be selected to synthesize the default speaker's speech. Alternatively, all of the utterances in the entire text to be processed may employ the same default speaker's voice synthesis.
In the case that the target sentence is a dialogue or a monologue, the target sentence may be further subjected to multi-role analysis and multi-emotion analysis. The multi-role analysis comprises role recognition, which is to automatically recognize a dialect or a monologue sender in a text, and the recognized roles can be represented by role names. In addition to acquiring the character itself, the attribute such as sex and age of the character may be acquired. Thus, the multi-persona analysis may also include character attribute identification to identify attributes such as gender and/or age of the character. The analysis of emotions is also called emotion recognition, i.e. the automatic recognition of the class of emotion held by a character when it emits a dialogue or a monologue in a specific context. The emotion categories may be classified and preset as needed, for example, all emotions may be classified into four categories: happiness, anger, grippe and happiness. Of course, the emotion may be divided more finely as needed, and the present invention is not limited thereto.
In general, in the case where the target sentence is a dialogue or a monologue, the preset item for which the text analysis is directed may include one or more of a text type, a character name, a character attribute, and an emotion category, and the analysis result corresponding to each analyzed preset item may be included in the initial text analysis result obtained at this time. It will be appreciated that the analysis result corresponding to any of the preset items may or may not have information, and may be a null result. For example, for a certain pair of white text, the analyzed preset item includes an emotion category, and the emotion recognition model may recognize that the emotion is "happy" here, but may fail to recognize, and the result cannot be recognized, and the analysis result at this time is an empty result. For the case where there is a null result, the user may select to manually supplement the related information, which will be described later.
For example, for items not involved in the text analysis operation, default settings or manual settings by the user may be employed.
In the case where the target sentence is bystander, the character name, character attribute and emotion category may be default, i.e., the information may be blank, and speech synthesis may be performed directly with default bystander sound (whose tone quality, tone, mood, etc. are determined). Thus, in the case where the target sentence is a bystander, the preset item for which the text analysis is directed may include a text type.
Because multi-role multi-emotion speech synthesis requires higher quality of the final synthesized audio, human intervention can be added after text analysis to further optimize the text analysis quality to obtain better quality synthesized speech. The implementation of manual intervention is described below.
In step S330, text result information including the initial text analysis result is output. And outputting the text result information for viewing by a user.
Any suitable output device may be used to output the text result information. In one example, the speech synthesis system may output text result information directly through an output device, which may be implemented using a display screen and/or speakers. In another example, the speech synthesis system may output the text result information to the client through the first output device, and the client outputs the text result information for viewing by the user through the second output device thereof, where the first output device may be implemented by a wired and/or wireless communication device, and the second output device may be implemented by a display screen and/or a speaker.
The speech synthesis system described herein may comprise a client and/or a server, i.e. it may be either the client or the server itself, or it may be formed by a combination of the client and the server. The client may be a personal computer, mobile terminal or the like.
The text result information may be represented in one or more forms such as audio information, text information, video information, image information, and the like. For example, the initial text analysis results may be displayed directly on a display screen in the form of text information for viewing by a user. For another example, the results of the initial text analysis may be played out in the form of audio information with a speaker for viewing by the user.
The text result information is output so that the user can learn the text analysis result obtained by the current analysis of the voice synthesis system, so that the user can check whether various pieces of information in the text analysis result have errors or omissions, and the user can select to correct the places where the errors exist or supplement the places where the omissions exist.
In step S340, text feedback information input by a user is received.
For example, the text feedback information may include modification information for modifying an analysis result corresponding to any one or more preset items in the text analysis results, and the speech synthesis system may modify the analysis result corresponding to any preset item in the text feedback information based on the modification information related to the preset item. That is, when the user finds that there is an error or omission in the output initial text analysis result, the corresponding modification information may be input to modify the initial text analysis result. Modifications described herein may include one or more of correcting error information, adding missing information, and deleting redundant information. For example, if the user finds that the character name item information is erroneously recognized, correction can be made to the character name, for example, "Zhang Sano" is changed to "Lifour". For another example, if the user finds that one of the ages in the character attribute was not successfully detected, the information may be manually supplemented if the initial text analysis result lacks the information or if the data at the information is empty. For another example, if the text type is originally bystanded and recognized as a dialect and the related character name or the like information is correspondingly recognized, the user may delete the character name or the like information while correcting the text type to be bystanded.
The text feedback information may also include, for example, correct feedback information (which may be referred to as first correct feedback information) related to the initial text analysis result, where the correct feedback information may be used to indicate that the initial text analysis result is free of problems, and does not require modification, so as to facilitate notifying the speech synthesis system to continue with subsequent operations (e.g., subsequent speaker adaptation, speech synthesis, etc.).
Any suitable receiving means may be employed to receive the text feedback information. In one example, the user may input text feedback information directly to the speech synthesis system, at which point the receiving device may be implemented using one or more of a keyboard, mouse, microphone, and touch screen. In another example, the user may also input text feedback information to the client through the input device of the client, and the client transmits the text feedback information to the speech synthesis system, where the receiving device may be implemented using a wired and/or wireless communication device.
In step S350, in the case where the text feedback information includes first modification information related to the initial text analysis result, the initial text analysis result is modified based on the first modification information to obtain a new text analysis result.
The first modification information is used for indicating that the initial text analysis result has a problem and needs to be modified. If the user inputs modification information for the initial text analysis result, the corresponding information part may be modified according to the modification information. Steps S340 and S350 may be referred to as manual modification (or manual configuration or manual intervention) steps.
In step S360, the target sentence is speech synthesized based at least on the final text analysis result to obtain a final synthesized speech corresponding to the target sentence, wherein the final text analysis result is an initial text analysis result in the case where the initial text analysis result is not modified, and the final text analysis result is a new text analysis result in the case where the initial text analysis result is modified.
For example, if the initial text analysis result is problematic, and no modification is required, the initial text analysis result may be directly determined as the final text analysis result. Otherwise, if the initial text analysis result has a problem and is modified according to the modification information indicated by the user, the new text analysis result may be determined as the final text analysis result. The final text analysis result is a text analysis result used to participate in the final speech synthesis.
For example, the final text analysis result may be applied to a speaker adaptation operation, speaker information matching a character (or a character and emotion) corresponding to the target sentence is found through speaker adaptation, and an acoustic model corresponding to the speaker information is found. Finally, speech synthesis may be performed based on the acoustic model to obtain the desired synthesized speech.
The existing multi-role multi-emotion voice synthesis scheme purely relying on models has the following disadvantages: the single-master model cannot guarantee the final synthesis effect, the actual effect (accuracy, recall, etc.) of the single model is difficult to reach 100%, and the error superposition of various models can further exacerbate the effect. By adopting the multi-role multi-emotion voice synthesis technology provided by the invention, the analysis result aiming at the text can be corrected in time by adding the manual intervention mode after the text analysis, so that the further superposition of errors is avoided. The scheme can combine the characteristics of high efficiency of the model, accuracy of manual intervention and the like, and can obtain high-quality synthesized voice while ensuring higher efficiency so as to better meet the requirements of application scenes such as audio books and the like which need multi-role and multi-emotion reading.
Illustratively, text analysis of any target sentence in the text to be processed (step S320) may include: performing text analysis on the target sentence by using a text analysis model; after modifying the initial text analysis result based on the first modification information (step S350), the method 300 may further include: taking the new text analysis result as annotation data, taking the initial text analysis result as prediction data, and calculating a loss function of the text analysis model; the text analysis model is optimized using the calculated loss function.
The text analysis model may include one or more of a text type analysis model, a character recognition model, an attribute recognition model, and a mood recognition model. Optionally, in the case that the at least one preset item includes a text type, the text analysis model may include a corresponding text type analysis model; in the case where the at least one preset item includes a character name, the text analysis model may include a corresponding character recognition model; in the case that the at least one preset item includes character attributes, the text analysis model may include a corresponding attribute recognition model; in case the at least one preset item comprises an emotion category, the text analysis model may comprise a corresponding emotion recognition model.
For example, the text type analysis model may take the target sentence and/or the context of the target sentence as input and the text type analysis result as output. Illustratively, the character recognition model may take the target sentence and/or the context of the target sentence as input and the character name analysis result as output. For example, the attribute identification model may take as input one or more of a target sentence, a context of the target sentence, and a role name analysis result, and as output the role attribute analysis result. For example, the emotion recognition model may take as input one or more of a target sentence, a context of the target sentence, a character name analysis result, and a character attribute analysis result, and take as output an emotion category analysis result.
Each of the text type analysis model, the character recognition model, the attribute recognition model, and the emotion recognition model may be implemented using any suitable model that may occur in the present or future, and will not be described in detail herein.
Any of the models (e.g., text type analysis model, character recognition model, attribute recognition model, emotion recognition model, or speaker adaptation model described below, speech synthesis model, etc.) may be constructed herein in the manner of a rule + network model. The rules may be any suitable machine learning rules. By way of example and not limitation, the network model described above may be implemented using one or more of network models such as convolutional neural network model (CNN), cyclic neural network model (RNN), transducer model (transducer), and the like.
For the text analysis model, the input may include information such as a target sentence or a target sentence and a context thereof, and the output may be a text analysis result corresponding to the target sentence. The initial text analysis result can be used as an automatic analysis result of a text analysis model, namely, prediction data, and the new text analysis result obtained by user modification can be used as annotation data (group trunk). A loss function may be calculated based on the prediction data and the annotation data, and a text analysis model may be optimized based on the loss function. Those skilled in the art will understand the meaning of the labeling data, the prediction data, and the loss function and its implementation, and will not be described in detail herein.
Optimizing the model based on the loss function may be such as back-propagation and gradient calculations based on the loss function, with the parameters of the model being updated until the model converges. Those skilled in the art will understand the implementation of model optimization and will not be described in detail herein.
The user can use the speech synthesis system to synthesize the multi-role and multi-emotion audio intended by the user as the user of the speech synthesis system. Meanwhile, the voice synthesis system can collect data generated in each use process of the user, and the text analysis model adopted by the voice synthesis system can be optimized in turn through modification of the text analysis result by the user. The optimization process can be iterated, and as the use time of a user is longer, the optimization effect of the model is better, the output result of the model is more reliable, and the modification required by the user when the user uses the model again is less.
In the field of multi-role multi-emotion speech synthesis, training data of a related model is very difficult to acquire. The difficulty of constructing large-scale high-quality multi-emotion multi-role training data from scratch by manpower is high, and a mode of combining a model and manpower is adopted, so that not only can the needed speech synthesis service be provided for users, but also the iterative optimization of the model can be synchronously realized. In addition, for the voice synthesis system, abundant training data can be obtained from a large number of users, so that the difficulty in obtaining model training data and the difficulty in manually marking can be greatly reduced, and meanwhile, the training efficiency of a model can be greatly improved.
Therefore, a high-quality text analysis result can be obtained through a mode of model and manual intervention voice synthesis, meanwhile, training data of the model can be expanded, and the model is optimized continuously and iteratively to improve the data processing effect of the model.
Illustratively, the text analysis model includes at least one preset analysis model corresponding to at least one preset item one by one, and the first modification information includes modification information of an analysis result corresponding to a specific preset item among the initial text analysis results; taking the new text analysis result as labeling data and the initial text analysis result as prediction data, and calculating the loss function of the text analysis model comprises the following steps: taking an analysis result corresponding to a specific preset item in the new text analysis result as marking data, taking an analysis result corresponding to the specific preset item in the initial text analysis result as prediction data, and calculating a specific loss function of a specific preset analysis model corresponding to the specific preset item; optimizing the text analysis model using the calculated loss function includes: and optimizing the specific preset analysis model by using the specific loss function.
What kind of preset analysis model corresponds to each preset item has been described above, and will not be described here again. For each preset item, if the user modifies the analysis result of the preset item, the preset analysis model corresponding to the preset item can be trained.
Illustratively, speech synthesizing the target sentence based at least on the final text analysis result (step S360) includes: performing speaker adaptation based on the final text analysis result to determine initial speaker information matched with the target sentence; determining initial speaker information as final speaker information; calling a corresponding voice synthesis model from a model library based on the final speaker information, wherein the model library is used for storing voice synthesis models corresponding to a plurality of groups of different speaker information one by one; and performing voice synthesis on the target sentence by using the called voice synthesis model.
A speaker may also be referred to as a voice library (i.e., a voice library). The acoustic models of many speakers (voice libraries) may be stored in advance. After finding the speaker (sound library) that matches the target sentence, speech synthesis may be performed based on the acoustic model of the speaker. Note that in this context, each set of speaker information is unique and corresponds to a unique acoustic model.
In one example, it may have unique speaker information (e.g., represented by speaker id) for the same speaker and have unique corresponding acoustic models. For example, if the speaker is a natural reddish, only a single acoustic model is created for the speaker, and if the emotion classification of the current target sentence is recognized as "happiness", the emotion classification of "happiness" can be used as a variable of the acoustic model, and the reddish acoustic model can be input together with the target sentence, so that the voice which is generated in a reddish sound, has the emotion of "happiness" and the content is consistent with the target sentence.
In another example, it may have more than one set of speaker information for the same speaker, each set of speaker information corresponding to an acoustic model. For example, the speaker is a natural human reddish, two sets of speaker information may be formed for the same natural human reddish, one set is "reddish+happy", corresponding to the sound of reddish in the happy state, and the other set is "reddish+sad", corresponding to the sound of reddish in the difficult state, and these two sets of speaker information may be represented by different identifiers (e.g., soundID). Also, for the two sets of speaker information, two acoustic models may be created, respectively. For example, corresponding speaker information may be determined based on the character name, character attribute, and emotion category of the target sentence, and thus the acoustic model may be obtained. In this case, the target sentence can be directly input into the acoustic model, and the voice with the matched role and emotion can be synthesized.
The model library may store several of the acoustic models described above. The model library may be implemented using a single memory space or may be implemented using a distributed memory space (e.g., stored in a distributed manner on a plurality of server units).
For example, the speech synthesis system may automatically match appropriate speaker information for the target sentence based on at least some of the information of the target sentence, the context of the target sentence, the text type analysis result, the character name analysis result, the character attribute analysis result, the emotion type analysis result, and the like, so as to obtain a corresponding speech synthesis model. Subsequently, speech synthesis can be performed using the speech synthesis model to obtain the desired synthesized speech.
Illustratively, performing speaker adaptation based on the final text analysis results includes: performing speaker adaptation based on analysis results corresponding to the character names, the character attributes and the emotion categories in the final text analysis results so as to obtain initial speaker information; the speech synthesis of the target sentence using the speech synthesis model includes: and inputting the target sentence into the voice synthesis model to obtain the final synthesized voice output by the voice synthesis model.
Corresponding acoustic models may be created depending on the gender, age, and emotion of the speaker himself, such that the same speaker may have more than one set of speaker information and more than one acoustic model. For example, different sets of speaker information may be represented by different identifiers, and the speaker adaptation may be performed by finding an identifier that matches the role and emotion of the target sentence. This case can be directly based on the target sentence for speech synthesis.
Illustratively, performing speaker adaptation based on the final text analysis results includes: performing speaker adaptation based on analysis results corresponding to the character names and the character attributes in the final text analysis results so as to obtain initial speaker information; the speech synthesis of the target sentence using the speech synthesis model includes: inputting the analysis results corresponding to the emotion categories in the target sentence and the final text analysis results into the voice synthesis model to obtain final synthesized voice output by the voice synthesis model.
The corresponding acoustic model can be created according to the gender and age of the speaker himself, so that the same speaker has unique speaker information and unique acoustic models. As described above, for this case, emotion type information of the target sentence may be input into the acoustic model together with the target sentence, obtaining the final synthesized speech.
Illustratively, after determining the initial speaker information as the final speaker information and before invoking the corresponding speech synthesis model from the model library based on the final speaker information, speech synthesizing the target sentence based at least on the final text analysis result (step S360) may further include: outputting initial speaker information; receiving feedback information of a speaker input by a user; modifying the initial speaker information based on the second modification information to obtain new speaker information in the case that the speaker feedback information includes the second modification information related to the initial speaker information; and updating the final speaker information to new speaker information.
Similar to the initial text analysis results, the user can also view the initial speaker information and modify it as needed. The modification of the initial speaker information may be understood with reference to the modification of the initial text analysis result, and will not be described herein.
The text feedback information may also include, for example, correct feedback information (which may be referred to as second correct feedback information) associated with the initial speaker information, where the correct feedback information may be used to indicate that the initial speaker information is free of problems, and does not require modification, so as to facilitate informing the speech synthesis system to continue with subsequent operations (e.g., subsequent speech synthesis based on a speech synthesis model, etc.).
For example, if the initial speaker information is problem-free, without modification thereof, the initial speaker information may be directly determined as the final speaker information. Otherwise, if the initial speaker information has a problem and is modified according to the modification information indicated by the user, the new speaker information may be determined as the final speaker information. The final speaker information is speaker information for participating in final speech synthesis.
For scenes with many characters and speakers (voice libraries) (such as hundreds of characters), matching errors are difficult to avoid by simply relying on models to perform speaker adaptation, and it is difficult to completely rely on manually configuring the corresponding relation between text and speaker information, so that efficiency is low. By means of the model and manual combination mode, the information of the pronounciators can be automatically matched at first, and the information of the pronounciators is manually modified by a user under the condition that errors exist. The scheme can give consideration to the efficiency and accuracy of the speaker adaptation.
In addition, by adopting the scheme, the initial text analysis result and the initial speaker information can be output so that the user can check and modify the whole initial text analysis result and the whole initial speaker information when needed, and the scheme can optimize the voice synthesis effect on the whole. The prior art mainly focuses on single tasks such as emotion styles, role prediction and the like, and has less overall consideration on the multi-role multi-emotion voice synthesis method. However, single point optimization is difficult to guarantee the overall effect. Optimizing the effect of a single task makes it difficult to guarantee the overall effect, while optimizing from the whole can effectively avoid this problem.
Illustratively, performing speaker adaptation based on the final text analysis results includes: performing speaker adaptation based on the final text analysis result by using the speaker adaptation model; after modifying the initial speaker information based on the second modification information, the method 300 may further include: taking the new speaker information as labeling data, taking the initial speaker information as prediction data, and calculating a loss function of the speaker adaptation model; optimizing the speaker adaptation model using the calculated loss function.
In the case of a user modifying the initial speaker information, the speaker adaptation model may be optimized by the new speaker information and the initial speaker information. The optimization of the speaker adaptation model is similar to the optimization of the text analysis model, and the optimization of the speaker adaptation model may be understood with reference to the above description of the optimization of the text analysis model, which is not described herein.
In the prior art, training for a text analysis model and a speaker adaptation model is performed independently, training data of the models are usually derived from different sentences, and the training scheme is difficult to achieve overall optimization of a speech synthesis system. In this embodiment, the relevant information of the same target sentence is used as training data of the text analysis model and the speaker adaptation model, so that the uniformity of data sources of the text analysis model and the speaker adaptation model is ensured, and the scheme can obtain better optimization for the whole voice synthesis system.
The target sentence, the text analysis result, the speaker information and the synthesized voice at any current time are constructed into a multi-element data structure in a unified form, wherein the multi-element data structure is expressed as < Content, type, character, gender, age, motion, soundID, audio >, wherein Content represents the text Content of the target sentence, type represents the text Type of the target sentence, character represents the name of a role corresponding to the target sentence, gender represents the sex of the role corresponding to the target sentence, age represents the Age of the role corresponding to the target sentence, motion represents the Emotion Type corresponding to the target sentence, soundID represents the speaker information corresponding to the target sentence, and Audio represents the synthesized voice corresponding to the target sentence.
During the whole speech synthesis process, a unified form of multi-element data structure can be used to represent the target sentence, text analysis result, speaker information and synthesized speech. The text analysis result at any current time may be an initial text analysis result or a new text analysis result, and of course, may also be an intermediate state result between the initial text analysis result and the new text analysis result. For example, the target sentence+initial text analysis result may be expressed by < Content, type ', character ', gender ', age ', emotion >, the user first modifies the analysis result of one of the Character names therein, the multi-element data structure is updated to < Content, type ', character, gender ', age ', emotion ', and then the user modifies the analysis result of one of the Character sexes therein, the multi-element data structure may be updated to < Content, type ', character, gender, age ', emotion ', and so on, and finally after all modifications are completed, the entire multi-element data structure is updated to < Content, type, character, gender, age, emotion >.
Similarly, the speaker information at any current time may be the initial speaker information or the new speaker information. The synthesized speech at any current time may be the final synthesized speech or the initial synthesized speech described below.
The scheme of representing various information by adopting the multi-element group data structure is convenient for information transmission by adopting a unified data structure among different algorithm modules of the voice synthesis system, which is beneficial to simplifying processing tasks and improving the processing efficiency of the system.
Illustratively, text analysis of any target sentence in the text to be processed (step S320) may include: acquiring the context of the target sentence; extracting initial candidate role names based on the context of the target sentence; determining the initial candidate role name as the final candidate role name; and determining the role name corresponding to the target statement through the final candidate role name.
It is often difficult to determine which role the statement is issued from the current target statement alone. Thus, the role of issuing the target sentence can be better recognized in conjunction with the context of the target sentence. Several candidate role names may be extracted from the target sentence and its context, and suitable candidate role names may be selected therefrom as the role names corresponding to the target sentence. The above-described character names are merely examples, and the present invention may determine the character names corresponding to the target sentence using any character recognition method that may occur in the present or future.
Illustratively, determining the role name corresponding to the target statement from the final candidate role name may include: and inputting the candidate character names and/or the text fragments containing the candidate character names, the target sentences and the contexts of the target sentences into the character recognition model to obtain character recognition results output by the character recognition model, wherein the character recognition results are used for indicating the character names corresponding to the target sentences.
Illustratively, the text result information further includes an initial candidate character name, and after determining that the initial candidate character name is a final candidate character name and before determining a character name corresponding to the target sentence by the final candidate character name, the method 300 may further include: modifying the initial candidate character name based on the third modification information to obtain a new candidate character name in the case that the text feedback information includes the third modification information related to the initial candidate character name; and updating the final candidate role name to be the new candidate role name.
Similar to the initial text analysis result and the initial speaker information, the initial candidate character names may also be output and modified by the user as needed. The original final candidate character name may be overlaid with the modified new candidate character name such that the final candidate character name is updated to the new candidate character name. The modification of the initial candidate character name may be understood with reference to the above modification of the initial text analysis result and the initial speaker information, and will not be described here.
Illustratively, extracting the initial candidate role name based on the target sentence and the context of the target sentence includes: extracting initial candidate role names based on the context of the target sentence by using a name naming entity recognition model; after modifying the initial candidate role name based on the third modification information, the method 300 may further include: taking the new candidate role name as marking data, taking the initial candidate role name as prediction data, and calculating a loss function of the name naming entity recognition model; and optimizing the name naming entity recognition model by using the calculated loss function.
A person name Named Entity Recognition (NER) model may recognize various person names contained in text. Several initial candidate character names can be extracted through the name NER model.
Similar to the text analysis model and the speaker adaptation model, the person name NER model may also be optimized based on the automatic analysis result (i.e., the initial candidate character name) and the manual modification result (i.e., the new candidate character name). The mode of model optimization may be described above, and is not described here.
Illustratively, prior to outputting the text result information (step S330), the method 300 may further include: receiving an information output instruction related to a target sentence; wherein the step of outputting text result information is performed in response to receipt of an information output instruction.
If the text to be processed contains a large number of sentences, the user needs to check and modify the sentences one by one according to the needs, which is very time and effort consuming for the user, so that the user can choose whether to check and modify the current sentences according to the needs. When the user wants to view and modify the current sentence, an information output instruction may be directly or indirectly (e.g., via a client) input to the speech synthesis system, which outputs text result information only if the information output instruction is received. The scheme can provide the option of freely selecting a plurality of sentences for viewing and modifying for the user, is convenient for the user to control the depth of manual intervention by himself, and has better user experience.
Illustratively, prior to receiving the information output instruction related to the target sentence, the method 300 may further include: performing speech synthesis based at least on the initial text analysis result to obtain an initial synthesized speech corresponding to the target sentence; and outputting the initial synthesized voice.
After the initial text analysis result is obtained, the voice synthesis is automatically performed first, the initial synthesized voice is obtained, and the initial synthesized voice is output. Therefore, the user can first check the effect of speech synthesis according to the current initial text analysis result (and the initial speaker information), so that the user can know which part of the initial text analysis result (and the initial speaker information) has a problem preliminarily. If the initial text analysis result (and the initial speaker information) is found to be problematic, the user can send an information output instruction to instruct the speech synthesis system to output text result information for the user to view and modify accordingly.
Fig. 4 shows a schematic diagram of a processing flow of a multi-persona, multi-emotion speech synthesis method according to one embodiment of the present invention. Referring to fig. 4, the process flow of the multi-role multi-emotion speech synthesis method is shown as follows:
(1) multi-persona, multi-emotion text analysis, comprising three aspects:
the method includes the steps of analyzing text types, distinguishing whether each sentence is a dialogue, a monologue or a bystander, extracting the dialogue or monologue sentences, and obtaining data containing target sentences and text type analysis results, wherein the data are formally described as follows: < Content, type >;
multi-role analysis, including three steps:
a. extracting name NER, namely candidate role names;
b. character recognition, namely recognizing a character name corresponding to the target sentence, wherein data comprising the target sentence, a text type analysis result and a character name analysis result can be obtained, and the data is formally described as follows: < Content, type', character >;
c. character attribute (gender, age, etc.) recognition, automatically extracting relevant attributes of the character, at which time data containing the target sentence itself, text type analysis result, character name analysis result, and character attribute analysis result can be obtained, formally described as: < Content, type ', character ', gender ', age >;
According to the method, the emotion type corresponding to the target sentence is predicted according to the information such as the target sentence and the context, the character name, the character attribute and the like, and at the moment, data (simply referred to as a processing result 1) containing the target sentence and an initial text analysis result (the initial text analysis result comprises a text type analysis result, a character name analysis result, a character attribute analysis result and an emotion type analysis result) can be obtained, wherein the formalized description is as follows: < Content, type ', character', gender ', age', motion >;
(2) manually configuring. The problematic place in (1) is manually modified according to the context information to obtain more accurate information such as roles, emotions and the like, and at this time, data (simply referred to as a processing result 2) containing the target sentence itself and the new text analysis result can be obtained, which is formally described as: < Content, type, character, gender, age, motion >;
(3) speaker (voice library) adaptation, namely, automatically adapting speaker information according to information such as target sentences and contexts thereof, character names, character attributes, emotion categories and the like, wherein data (simply called a processing result 3) comprising the target sentences themselves, new text analysis results and initial speaker information can be obtained at the moment, and the method is formally described as follows: < Content, type, character, gender, age, motion, soundID >;
(4) Multi-role multi-emotion speech synthesis. On the basis of (3), a corresponding speech synthesis model is called for synthesis to obtain final synthesized speech, and at this time, data (simply referred to as a processing result 4) including the target sentence itself, the new text analysis result, the initial speaker information and the final synthesized speech can be obtained, which is formally described as: < Content, type, character, gender, age, motion, soundID', audio >.
The speech synthesis flow shown in fig. 4 is only an example and not a limitation of the present invention, e.g. after speaker adaptation a manual configuration operation may be performed again (this time with the speaker information modified), in which case new speaker information may be obtained, and thus a new data structure < Content, type, character, gender, age, motion, soundID >.
The global configuration scheme of the text analysis method 100 described above and the manual configuration scheme of the speech synthesis method 300 can be combined together, that is, the text analysis result obtained by the analysis can be further checked and modified by the user if necessary when the text analysis is performed on any target sentence in the text to be processed by combining the global role information. If the dialogue of a character belongs to high-frequency content in the whole text to be processed, and the recognition error of the character in the initial text analysis result is relatively more, the manual intervention link needs to be repeatedly modified. This can lead to inefficiency of the manual intervention process, increase the time (manual) overhead of the overall speech synthesis, and can greatly reduce the user experience. For this problem, the configuration of global character information can be performed before text analysis is performed on the local sentence, and character information appearing in the entire text can be collected as fully as possible by the global character information. The mode can effectively reduce the workload of manual configuration. The number of dialects in a larger text to be processed (such as a middle and long novel) is often thousands, and the data volume required to be modified by a user can be greatly reduced by configuring global role information, so that the efficiency of voice synthesis is improved, and the user experience is improved.
Illustratively, before performing text analysis on any target sentence in the text to be processed to obtain an initial text analysis result, the method further includes: performing name recognition on the text to be processed to determine all names appearing in the text to be processed; clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one; determining global role information at least based on at least one personal name set, wherein the global role information comprises at least one group of role information corresponding to at least one role one by one, each group of role information comprises a representative role name and an alias set of the corresponding role, and the alias set comprises the personal names except the representative role name in the personal name set of the corresponding role; performing text analysis on any target sentence in the text to be processed to obtain an initial text analysis result comprises: and carrying out text analysis on the target sentence by combining the global role information so as to obtain an initial text analysis result.
Illustratively, each of the at least one set of character information further includes character attribute information of the corresponding character, the character attribute information including character gender information and/or character age information.
Illustratively, the character attribute information includes character gender information, and determining global character information based at least on the at least one set of personal names includes: for any specific character of the at least one character, determining the sex of the specific character through the sex classification result obtained by one or more of the first operation, the second operation and the third operation respectively, so as to obtain character sex information of the specific character; wherein the first operation comprises: searching a gender-indicating person name capable of indicating gender from a person name set corresponding to the specific role; if at least one person is found to indicate the name, analyzing the gender of the specific role based on the gender indicated by the at least one person to obtain a first gender classification result; wherein the second operation comprises: searching one or more personal name pronoun sets which are matched with any one or more personal names in a personal name set corresponding to a specific role in a to-be-processed text in a one-to-one correspondence manner, wherein each personal name pronoun set comprises one or more personal name pronouns; if the at least one set of personal pronouns is found, analyzing the gender of the specific role based on the gender indicated by each personal pronoun in the at least one set of personal pronouns to obtain a second gender classification result; wherein the third operation comprises: and inputting the text segment containing the name, the specific sentence containing the name and the context of the specific sentence into a gender classification model for each name in the name set corresponding to the specific character to obtain a third gender classification result output by the gender classification model, wherein the third gender classification result is used for indicating the gender of the specific character.
Illustratively, the character gender classification into at least two genders, wherein analyzing the gender of the particular character based on the gender indicated by the at least one persona indication name to obtain the first gender classification result comprises: determining the gender indicated by each of the at least one personally indicated person name; for each of at least two sexes, calculating a ratio of a number of gender-indicative personal names belonging to the gender to a total number of at least one of the individual-indicative personal names to obtain a first gender ratio corresponding to the genderP i1 The first gender classification result comprises a first gender ratio corresponding to at least two sexes one by oneP i1 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……NNis the total number of sexes in at least two sexes; and/or analyzing the gender of the particular persona based on the gender indicated by each of the set of at least one personal pronoun to obtain a second gender classification result comprises: determining a gender indicated by each of the set of at least one personal pronoun; for each of at least two sexes, calculating the number of human pronouns belonging to that sex and the number of human pronouns in at least one set of human pronounsTo obtain a second gender ratio corresponding to the gender P i2 The second gender classification result comprises a second gender ratio corresponding to at least two sexes one by oneP i2 The method comprises the steps of carrying out a first treatment on the surface of the And/or the third gender classification result comprises gender probabilities corresponding to at least two sexes one to oneP i3 Sex probabilityP i3 For representing the probability that the sex of a particular character is the corresponding sex.
Illustratively, determining the gender of the particular persona from the gender classification results obtained by each of one or more of the first, second, and third operations to obtain the persona gender information for the particular persona includes: determining a first gender ratio based on the first gender classification resultP i1 The maximum gender is the gender of the specific role, so as to obtain the role gender information; alternatively, the second gender ratio is determined based on the second gender classification resultP i2 The maximum gender is the gender of the specific role, so as to obtain the role gender information; or determining gender probabilities based on third gender classification resultsP i3 The maximum gender is the gender of the specific role, so as to obtain the role gender information; alternatively, for each of at least two sexes, the first sex corresponding to that sex is determined to be the ratioP i1 Second sex ratioP i2 And gender probabilityP i3 And (3) carrying out weighted average or direct addition on two or three of the gender and the gender with the largest total gender ratio to obtain the gender information of the specific roles.
Illustratively, determining the gender of the particular persona from the gender classification results obtained by each of one or more of the first, second, and third operations to obtain the persona gender information for the particular persona includes: first, performing a first operation, and determining character gender information based on a first gender classification result if a first condition is satisfied, wherein the first condition includes: searching at least one person name indicating person name; and if the first condition is not satisfied, performing a second operation, and if the second condition is satisfied, determining character gender information based on a second gender classification result, wherein the second condition includes: searching at least one personal name pronoun set; and if the second condition is not satisfied, performing a third operation and determining character gender information based on the third gender classification result.
Illustratively, the first condition further comprises: in the first gender classification result, the maximum first gender ratioP i1 Greater than a first proportional threshold; and/or the second condition further comprises: in the second gender classification result, the maximum second gender ratioP i2 Greater than a second proportional threshold.
Illustratively, searching one or more personal name pronoun sets which are matched with any one or more personal names in the personal name set corresponding to the specific role in a one-to-one correspondence manner from the text to be processed comprises: for each person name in a person name set corresponding to a specific role, acquiring at least one specific sentence containing the person name and respective context of the at least one specific sentence; for each particular sentence of the at least one particular sentence, a human being's pronoun closest to the name of the person is found from the particular sentence and the context of the particular sentence as a human being's pronoun matching the name of the person.
Illustratively, the character attribute information includes character age information, and determining global character information based at least on the at least one set of personal names includes: determining the age of the specific character through the age prediction result obtained by the fourth operation and/or the fifth operation for any specific character in the at least one character to obtain the character age information of the specific character; wherein the fourth operation comprises: searching an age indication person name capable of indicating the age from a person name set corresponding to the specific role; if at least one age indicating person name is found, analyzing the age of the specific character based on the age indicated by the at least one age indicating person name to obtain a first age prediction result; wherein the fifth operation comprises: and inputting the text segment containing the name, the specific sentence containing the name and the context of the specific sentence into an age prediction model for each name in the name set corresponding to the specific character to obtain a second age prediction result output by the age prediction model, wherein the second age prediction result is used for indicating the age of the specific character.
Illustratively, the age of the character is divided into at least two age categories, wherein analyzing the age of the specific character based on the age indicated by the at least one age-indicated person name to obtain the first age prediction result comprises: determining an age category indicated by each of the at least one age-indicated person name; for each of at least two age categories, calculating a ratio of a number of age-indicative personal names belonging to the age category to a total number of at least one age-indicative personal name to obtain an age-to-weight ratio corresponding to the age category P i4 The first age prediction result comprises age ratios corresponding to at least two age categories one by oneP i4 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……MMis the total number of age categories in the at least two age categories; and/or the second age result comprises age probabilities corresponding to at least two age categories one to oneP i5 Age probabilityP i5 For representing the probability that the age of a particular character is the corresponding age category.
Illustratively, determining the age of the specific character through the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character includes: determining an age ratio based on the first age prediction resultP i4 The maximum age category is the age of the specific role to obtain role age information; alternatively, the age probability is determined based on the second age prediction resultP i5 The maximum age category is the age of the specific role to obtain role age information; or for each of at least two age categories, the age ratio corresponding to the age category is calculatedP i4 And age probabilityP i5 And carrying out weighted average or direct addition to obtain the total age ratio, and determining the age category with the maximum total age ratio as the age of the specific role to obtain the role age information.
Illustratively, determining the age of the specific character through the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character includes: first, a fourth operation is performed, and character age information is determined based on the first age prediction result if a third condition is satisfied, wherein the third condition includes: finding at least one age-indicated person name; if the third condition is not satisfied, a fifth operation is performed and character gender information is determined based on the second age prediction result.
Illustratively, the third condition further comprises: in the first age prediction result, the maximum age ratioP i4 Greater than a third proportional threshold.
Illustratively, determining global role information based at least on the at least one set of personal names includes: for each character in at least one character, analyzing the person name characteristics of each person name in the person name set corresponding to the character, wherein the person name characteristics comprise one or more of the following: the number of occurrences, whether the surname is contained, whether the common name is contained; and selecting the name with the name characteristics meeting the preset requirements from the name set corresponding to the character as the representative character name of the character.
Illustratively, the preset requirements include: the number of occurrences is the greatest in the set of names corresponding to the character, or the number of occurrences is the greatest in names containing common characters of surnames and/or names.
Illustratively, before performing text analysis on any target sentence in the text to be processed to obtain an initial text analysis result, the method further includes: outputting global role information; receiving fourth modification information related to global role information input by a user; the global role information is modified based on the fourth modification information.
Illustratively, text analyzing the target sentence in conjunction with the global role information to obtain an initial text analysis result includes: analyzing the text type of the target sentence to obtain an analysis result corresponding to the text type, wherein the text analysis result comprises the analysis result corresponding to the text type; under the condition that the target sentence belongs to the multi-role type, identifying an initial role name corresponding to the target sentence at least based on the target sentence; retrieving target character information containing an initial character name from the global character information, the target character information corresponding to the target character; for any specific preset item in the at least one preset item, extracting information corresponding to the specific preset item from the target role information, and determining the extracted information as an analysis result corresponding to the specific preset item in the initial text analysis result, wherein in the case that the specific preset item is a role name, the extracted information is a representative role name in the target role information.
Illustratively, before extracting information corresponding to a specific preset item from the target character information for any specific preset item of the at least one preset item, and determining the extracted information as an analysis result corresponding to the specific preset item from the initial text analysis results, the method further includes: under the condition that the target sentence belongs to the multi-role type, analyzing the character attribute corresponding to the target sentence to obtain an analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute, wherein the initial text analysis result comprises the analysis result corresponding to the character attribute; if the confidence coefficient of the analysis result corresponding to the character attribute is lower than a preset confidence coefficient threshold value, determining that the character attribute is a specific preset item; for any specific preset item in the at least one preset item, extracting information corresponding to the specific preset item from the target character information, and determining that the extracted information is an analysis result corresponding to the specific preset item in the initial text analysis results comprises: extracting character attribute information from the target character information; and covering the analysis result corresponding to the character attribute in the initial text analysis result by using the extracted character attribute information.
Illustratively, the classifying of the character attribute into at least two types of attributes, and the analyzing the character attribute corresponding to the target sentence to obtain the analysis result corresponding to the character attribute and the confidence of the analysis result corresponding to the character attribute includes: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; acquiring a person name set corresponding to a target role from global role information; identifying all text fragments of the target role from the text to be analyzed based on the name set corresponding to the target role; performing attribute identification on each text segment in all text segments to determine attributes corresponding to all text segments one by one; for each of at least two types of attributes, calculating the ratio of the number of text fragments corresponding to the attribute to the total number of all text fragments to obtain an attribute duty ratio; selecting the attribute with the largest attribute proportion as the character attribute corresponding to the target sentence to obtain an analysis result corresponding to the character attribute; and determining the attribute proportion of the attribute with the largest attribute proportion as the confidence of the analysis result corresponding to the character attribute.
Illustratively, analyzing the character attribute corresponding to the target sentence to obtain the analysis result corresponding to the character attribute and the confidence level of the analysis result corresponding to the character attribute includes: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; inputting the text to be analyzed into the attribute identification model to obtain the analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute output by the attribute identification model.
Illustratively, identifying the initial role name corresponding to the target statement based at least on the target statement includes: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; extracting initial candidate role names based on the text to be analyzed; determining the initial candidate role name as the final candidate role name; and inputting the global character information, the final candidate character names and the text to be analyzed into a character recognition model to obtain a character recognition result output by the character recognition model, wherein the character recognition result is used for indicating the character name corresponding to the target sentence as an initial character name.
Illustratively, in the case that the target sentence belongs to the multi-role type, performing text analysis on the target sentence in combination with the global role information after identifying an initial role name corresponding to the target sentence based at least on the target sentence, so as to obtain an initial text analysis result further includes: acquiring a text to be analyzed, wherein the text to be analyzed comprises a target sentence and a context of the target sentence; and inputting the global character information, the initial character name and the text to be analyzed into an attribute identification model to obtain an analysis result which is output by the attribute identification model and corresponds to the character attribute.
Illustratively, clustering together names belonging to the same persona among all the names to obtain at least one set of names that corresponds one-to-one to at least one persona includes: clustering the names belonging to the same role in all the names together to obtain a name set corresponding to all the roles in the text to be processed one by one; and selecting the roles with the number of the names in the corresponding name set larger than a preset number threshold from all the roles as at least one role.
Fig. 5 shows a schematic diagram of text analysis in connection with global role information in a speech synthesis process according to one embodiment of the invention. The above-described embodiment can be understood with reference to fig. 5.
According to another aspect of the present invention, there is provided a text analysis device. Fig. 6 shows a schematic block diagram of a text analysis device 600 according to an embodiment of the invention.
As shown in fig. 6, the text analysis device 600 according to an embodiment of the present invention includes an acquisition module 610, a person name recognition module 620, a clustering module 630, a global determination module 640, and a text analysis module 650. The various modules may perform the various steps/functions of the text analysis method 100 described above in connection with fig. 1-2, respectively. Only the main functions of the respective components of the text analysis device 600 will be described below, and the details already described above will be omitted.
The obtaining module 610 is configured to obtain text to be processed.
The person name recognition module 620 is configured to perform person name recognition on the text to be processed to determine all person names appearing in the text to be processed.
The clustering module 630 is configured to cluster names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one.
The global determining module 640 is configured to determine global role information based at least on at least one person name set, where the global role information includes at least one set of role information corresponding to at least one role, each set of role information includes a representative role name of the corresponding role and an alias set, and the alias set includes person names except the representative role name in the person name set of the corresponding role.
The text analysis module 650 is configured to perform text analysis on any target sentence in the text to be processed in combination with the global role information, so as to obtain a text analysis result corresponding to the target sentence, where the text analysis includes analysis on at least one preset item, and the at least one preset item includes one or more of the following: the method comprises the steps of text type, character name and character attribute, wherein the analysis of the text type refers to judging whether a target sentence belongs to a multi-role type, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age.
According to another aspect of the present invention, a text analysis system is provided. Fig. 7 shows a schematic block diagram of a text analysis system 700 according to one embodiment of the invention. Text analysis system 700 includes a processor 710 and a memory 720.
The memory 720 stores computer program instructions for implementing the respective steps in the text analysis method 100 according to an embodiment of the present invention.
The processor 710 is configured to execute computer program instructions stored in the memory 720 to perform the corresponding steps of the text analysis method 100 according to an embodiment of the present invention.
According to another aspect of the present invention, there is provided a storage medium having stored thereon program instructions for performing the respective steps of the text analysis method 100 of the embodiment of the present invention when the program instructions are executed by a computer or a processor, and for realizing the respective modules in the text analysis device 600 according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.
According to another aspect of the present invention, there is provided a speech synthesis method (represented by a first speech synthesis method) including the above text analysis method 100, wherein the speech synthesis method further includes: and performing voice synthesis on the target sentence at least based on the text analysis result so as to obtain a synthesized voice corresponding to the target sentence.
The speech synthesis may be performed directly based on the text analysis result obtained in step S150 of the text analysis method 100 described above without additional modification of the text analysis result.
According to another aspect of the present invention, a speech synthesis apparatus (represented by a first speech synthesis apparatus) is provided. The speech synthesis apparatus according to the embodiment of the present invention includes the above-described text analysis apparatus 600 and a speech synthesis module. The voice synthesis module is used for performing voice synthesis on the target sentence at least based on the text analysis result so as to obtain a synthesized voice corresponding to the target sentence. The respective modules may perform the respective steps/functions of the first speech synthesis method described above, respectively.
According to another aspect of the present invention, a speech synthesis system is provided. The speech synthesis system includes a processor and a memory. The memory stores computer program instructions for implementing the respective steps in the above-described first speech synthesis method according to an embodiment of the present invention. The processor is configured to execute computer program instructions stored in the memory to perform the respective steps of the first speech synthesis method according to an embodiment of the invention.
According to another aspect of the present invention there is provided a storage medium having stored thereon program instructions which, when executed by a computer or processor, are adapted to carry out the respective steps of the first speech synthesis method of an embodiment of the present invention and to carry out the respective modules in the first speech synthesis apparatus according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.
According to another aspect of the present invention, there is provided a speech synthesis method (represented by a second speech synthesis method) including the above text analysis method 100, wherein the speech synthesis method further includes: outputting text result information, wherein the text result information comprises an initial text analysis result obtained by the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information; receiving text feedback information input by a user; modifying the initial text analysis result based on the first modification information to obtain a new text analysis result in the case that the text feedback information includes the first modification information related to the initial text analysis result; and performing speech synthesis on the target sentence at least based on the final text analysis result to obtain a final synthesized speech corresponding to the target sentence, wherein the final text analysis result is an initial text analysis result without modification of the initial text analysis result, and the final text analysis result is a new text analysis result with modification of the initial text analysis result.
The text analysis result obtained in step S150 of the above-described text analysis method 100 may be used as an initial text analysis result, which is further corrected in a manually configured manner such as described in the speech synthesis method 300, and then speech synthesis is performed.
According to another aspect of the present invention, a speech synthesis apparatus (represented by a second speech synthesis apparatus) is provided. The speech synthesis apparatus according to the embodiment of the present invention includes the above-described text analysis apparatus 600, and an output module, a receiving module, a modifying module, and a speech synthesis module. The respective modules may perform the respective steps/functions of the second speech synthesis method described above, respectively.
The output module is used for outputting text result information, wherein the text result information comprises an initial text analysis result obtained through the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information.
The receiving module is used for receiving text feedback information input by a user.
The modification module is used for modifying the initial text analysis result based on the first modification information to obtain a new text analysis result when the text feedback information comprises the first modification information related to the initial text analysis result.
The voice synthesis module is used for performing voice synthesis on the target sentence at least based on the final text analysis result so as to obtain final synthesized voice corresponding to the target sentence, wherein the final text analysis result is an initial text analysis result under the condition that the initial text analysis result is not modified, and the final text analysis result is a new text analysis result under the condition that the initial text analysis result is modified.
According to another aspect of the present invention, a speech synthesis system is provided. The speech synthesis system includes a processor and a memory. The memory stores computer program instructions for implementing the respective steps in the above-described second speech synthesis method according to an embodiment of the present invention. The processor is configured to execute computer program instructions stored in the memory to perform the corresponding steps of the second speech synthesis method according to an embodiment of the present invention.
According to another aspect of the present invention there is provided a storage medium having stored thereon program instructions which, when executed by a computer or processor, are adapted to carry out the respective steps of the second speech synthesis method of an embodiment of the present invention and to carry out the respective modules in the second speech synthesis apparatus according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.
Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the modules in a text analysis system and/or a speech synthesis system according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims (30)

1. A text analysis method, comprising:
acquiring a text to be processed;
performing name recognition on the text to be processed to determine all names appearing in the text to be processed;
clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one;
determining global role information at least based on the at least one personal name set, wherein the global role information comprises at least one group of role information corresponding to the at least one role one by one, each group of role information comprises a representative role name of the corresponding role and an alias set, and the alias set comprises personal names except the representative role name in the personal name set of the corresponding role;
performing text analysis on any target sentence in the text to be processed in combination with the global role information to obtain a text analysis result corresponding to the target sentence, wherein the text analysis comprises analysis on at least one preset item, and the at least one preset item comprises one or more of the following items: the method comprises the steps of analyzing a text type, a character name and a character attribute, wherein the text type is used for judging whether the target sentence belongs to a multi-role type or not, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age;
The text analysis of any target sentence in the text to be processed by combining the global role information comprises the following steps:
analyzing the text type of the target sentence to obtain an analysis result corresponding to the text type, wherein the text analysis result comprises the analysis result corresponding to the text type;
identifying an initial role name corresponding to the target sentence at least based on the target sentence under the condition that the target sentence belongs to the multi-role type;
retrieving target character information containing the initial character name from the global character information, wherein the target character information corresponds to a target character;
extracting information corresponding to a specific preset item from the target character information for any specific preset item of the at least one preset item, and determining the extracted information as an analysis result corresponding to the specific preset item in the text analysis result, wherein in the case that the specific preset item is a character name, the extracted information is a representative character name in the target character information;
wherein before extracting information corresponding to the specific preset item from the target character information for any specific preset item of the at least one preset item and determining that the extracted information is an analysis result corresponding to the specific preset item in the text analysis result, the method further includes:
Analyzing character attributes corresponding to the target sentence under the condition that the target sentence belongs to the multi-role type to obtain analysis results corresponding to the character attributes and confidence degrees of the analysis results corresponding to the character attributes, wherein the text analysis results comprise the analysis results corresponding to the character attributes;
if the confidence coefficient of the analysis result corresponding to the character attribute is lower than a preset confidence coefficient threshold value, determining the character attribute as the specific preset item;
the extracting information corresponding to the specific preset item from the target character information for any specific preset item in the at least one preset item, and determining that the extracted information is an analysis result corresponding to the specific preset item in the text analysis results includes:
extracting character attribute information from the target character information;
and covering the analysis results corresponding to the character attribute in the text analysis results by using the extracted character attribute information.
2. The method of claim 1, wherein each set of character information in the at least one set of character information further comprises character attribute information of a corresponding character, the character attribute information comprising character gender information and/or character age information.
3. The method of claim 2, wherein the character attribute information comprises character gender information, and the determining global character information based at least on the at least one set of personal names comprises:
for any specific character of the at least one character, determining the gender of the specific character through the gender classification result obtained by one or more of the first operation, the second operation and the third operation respectively so as to obtain the character gender information of the specific character;
wherein the first operation comprises:
searching a gender-indicating person name capable of indicating gender from a person name set corresponding to the specific role;
if at least one person is found to indicate the name of the person, analyzing the gender of the specific role based on the gender indicated by the at least one person to obtain a first gender classification result;
wherein the second operation comprises:
searching one or more personal name pronoun sets which are matched with any one or more personal names in the personal name set corresponding to the specific role in a one-to-one correspondence manner from the text to be processed, wherein each personal name pronoun set comprises one or more personal name pronouns;
analyzing the gender of the specific role based on the gender indicated by each personal pronoun in the at least one personal pronoun set if the at least one personal pronoun set is found, so as to obtain a second gender classification result;
Wherein the third operation includes:
and inputting a text segment containing the name, a specific sentence containing the name and the context of the specific sentence into a gender classification model for each name in the name set corresponding to the specific character to obtain a third gender classification result output by the gender classification model, wherein the third gender classification result is used for indicating the gender of the specific character.
4. The method of claim 3, wherein the character sexes are divided into at least two sexes, wherein,
the analyzing the gender of the specific character based on the gender indicated by the at least one persona indication name to obtain a first gender classification result includes:
determining the gender indicated by each of the at least one person indicative of a person name;
for each of the at least two sexes, calculating a ratio of a number of gender-indicative personal names belonging to the gender to a total number of the at least one personally-indicative personal names to obtain a first gender ratio corresponding to the genderP i1 The first gender classification result comprises a first gender ratio corresponding to the at least two sexes one by one P i1 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……NNis the total number of sexes in the at least two sexes; and/or the number of the groups of groups,
the analyzing the gender of the particular persona based on the gender indicated by each of the at least one set of personal pronouns to obtain a second gender classification result includes:
determining a gender indicated by each personal pronoun in the at least one set of personal pronouns;
for each of the at least two sexes, calculating a ratio of a number of human pronouns belonging to the gender to a total number of all human pronouns in the at least one set of human pronouns to obtain a second gender ratio corresponding to the genderP i2 The second gender classification result comprises a second gender ratio corresponding to the at least two sexes one by oneP i2 The method comprises the steps of carrying out a first treatment on the surface of the And/or the number of the groups of groups,
the third gender classification result comprises gender probabilities corresponding to the at least two sexes one by oneP i3 The sex probabilityP i3 For representing the probability that the sex of the specific character is the corresponding sex.
5. The method of claim 4, wherein the determining the gender of the specific character through the gender classification result obtained by each of one or more of the first operation, the second operation, and the third operation to obtain the character gender information of the specific character comprises:
Determining a first gender ratio based on the first gender classification resultP i1 The maximum gender is the gender of the specific role, so as to obtain the role gender information; or alternatively, the process may be performed,
determining a second gender ratio based on the second gender classification resultP i2 The maximum gender is the gender of the specific role, so as to obtain the role gender information; or alternatively
Determining gender probabilities based on the third gender classification resultP i3 The maximum gender is the gender of the specific role, so as to obtain the role gender information; or alternatively, the process may be performed,
for each of the at least two sexes, the first sex corresponding to the sex is calculatedP i1 Second sex ratioP i2 And gender probabilityP i3 And (3) carrying out weighted average or direct addition on two or three of the above characters to obtain a total gender ratio, and determining the gender with the largest total gender ratio as the gender of the specific character to obtain the gender information of the character.
6. The method of claim 3, wherein the determining the gender of the specific character through the gender classification result obtained by each of one or more of the first operation, the second operation, and the third operation to obtain the character gender information of the specific character comprises:
First, the first operation is executed, and the character gender information is determined based on the first gender classification result if a first condition is satisfied, wherein the first condition comprises: searching the at least one person name indicating person name;
and if the first condition is not satisfied, performing the second operation, and if the second condition is satisfied, determining the character gender information based on the second gender classification result, wherein the second condition includes: searching the at least one personal name pronoun set;
and if the second condition is not satisfied, performing the third operation and determining the character gender information based on the third gender classification result.
7. The method of claim 4 or 5, wherein the determining the gender of the specific character through the gender classification result obtained by each of one or more of the first operation, the second operation, and the third operation to obtain the character gender information of the specific character comprises:
first, the first operation is executed, and the character gender information is determined based on the first gender classification result if a first condition is satisfied, wherein the first condition comprises: searching the at least one person name indicating person name;
And if the first condition is not satisfied, performing the second operation, and if the second condition is satisfied, determining the character gender information based on the second gender classification result, wherein the second condition includes: searching the at least one personal name pronoun set;
and if the second condition is not satisfied, performing the third operation and determining the character gender information based on the third gender classification result.
8. The method of claim 7, wherein,
the first condition further includes: in the first gender classification result, the maximum first gender ratioP i1 Greater than a first proportional threshold; and/or
The second condition further includes: in the second gender classification result, the maximum second gender ratioP i2 Greater than a second proportional threshold.
9. The method according to any one of claims 3 to 5, wherein the searching, from the text to be processed, one or more personal name sets that match any one or more personal names in the personal name set corresponding to the specific character in a one-to-one correspondence includes:
for each person name in the set of person names corresponding to the particular persona,
acquiring the context of at least one specific sentence containing the name of the person;
For each particular sentence in the at least one particular sentence, searching for a human pronoun closest to the name from the particular sentence and the context of the particular sentence as a human pronoun matching the name.
10. The method of claim 2, wherein the persona attribute information includes persona age information, and the determining global persona information based at least on the at least one set of personal names includes:
for any specific character in the at least one character, determining the age of the specific character through the age prediction result obtained by the fourth operation and/or the fifth operation so as to obtain the character age information of the specific character;
wherein the fourth operation includes:
searching an age indication personal name capable of indicating the age from a personal name set corresponding to the specific role;
analyzing the age of the specific role based on the age indicated by the at least one age-indicated person name if the at least one age-indicated person name is found, to obtain a first age prediction result;
wherein the fifth operation includes:
and inputting a text segment containing the name, a specific sentence containing the name and the context of the specific sentence into an age prediction model for each name in the name set corresponding to the specific character to obtain a second age prediction result output by the age prediction model, wherein the second age prediction result is used for indicating the age of the specific character.
11. The method of claim 10, wherein the persona ages are divided into at least two age categories, wherein,
the analyzing the age of the specific character based on the age indicated by the at least one age-indicated person name to obtain a first age prediction result includes:
determining an age category indicated by each of the at least one age-indicated person name;
for each of the at least two age categories, calculating a ratio of a number of age-indicated personal names belonging to the age category to a total number of the at least one age-indicated personal name to obtain an age-to-weight ratio corresponding to the age categoryP i4 The first age prediction result comprises age ratios corresponding to the at least two age categories one to oneP i4 Wherein, the method comprises the steps of, wherein,i = 1, 2, 3……MMis the total number of age categories in the at least two age categories; and/or the number of the groups of groups,
the second age prediction result comprises age probabilities corresponding to the at least two age categories one to oneP i5 The age probabilityP i5 For representing the probability that the age of the specific character is the corresponding age category.
12. The method of claim 11, wherein the determining the age of the specific character through the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character comprises:
Based on the first age prediction resultDetermining age ratioP i4 The maximum age category is the age of the specific role, so as to obtain the age information of the role; or alternatively, the process may be performed,
determining an age probability based on the second age prediction resultP i5 The maximum age category is the age of the specific role, so as to obtain the age information of the role; or alternatively
For each age category in the at least two age categories, the age ratio corresponding to the age category is calculatedP i4 And age probabilityP i5 And carrying out weighted average or direct addition to obtain the total age ratio, and determining the age category with the maximum total age ratio as the age of the specific role to obtain the age information of the role.
13. The method of claim 10, wherein the determining the age of the specific character through the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain the character age information of the specific character comprises:
first, the fourth operation is performed, and if a third condition is satisfied, the character age information is determined based on the first age prediction result, wherein the third condition includes: searching the at least one age-indicated person name;
And if the third condition is not satisfied, performing the fifth operation and determining the character gender information based on the second age prediction result.
14. The method of claim 11 or 12, wherein the determining the age of the specific character through the age prediction result obtained by each of the fourth operation and/or the fifth operation to obtain character age information of the specific character comprises:
first, the fourth operation is performed, and if a third condition is satisfied, the character age information is determined based on the first age prediction result, wherein the third condition includes: searching the at least one age-indicated person name;
and if the third condition is not satisfied, performing the fifth operation and determining the character gender information based on the second age prediction result.
15. The method of claim 14, wherein,
the third condition further includes: in the first age prediction result, the maximum age ratioP i4 Greater than a third proportional threshold.
16. The method of any of claims 1 to 5, wherein the determining global role information based at least on the at least one set of personal names comprises:
For each of the at least one character,
analyzing the name characteristics of each name in the name set corresponding to the role, wherein the name characteristics comprise one or more of the following: the number of occurrences, whether the surname is contained, whether the common name is contained;
and selecting the name with the name characteristics meeting the preset requirements from the name set corresponding to the character as the representative character name of the character.
17. The method of claim 16, wherein the preset requirements include: the number of occurrences is the greatest in the set of names corresponding to the character, or the number of occurrences is the greatest in names containing common characters of surnames and/or names.
18. The method of any of claims 1 to 5, wherein prior to said text analysis of any target sentence in the text to be processed in conjunction with the global role information, the method further comprises:
outputting the global role information;
receiving modification information related to the global role information input by a user;
and modifying the global role information based on the modification information.
19. The method of claim 1, wherein the role attributes are divided into at least two types of attributes, and the analyzing the role attributes corresponding to the target sentence to obtain the analysis result corresponding to the role attributes and the confidence of the analysis result corresponding to the role attributes comprises:
Acquiring a text to be analyzed, wherein the text to be analyzed comprises the target sentence and the context of the target sentence;
acquiring a person name set corresponding to the target role from the global role information;
identifying all text fragments of the target role from the text to be analyzed based on a name set corresponding to the target role;
performing attribute identification on each text segment in all the text segments to determine attributes corresponding to all the text segments one by one;
for each type of attribute in the at least two types of attributes, calculating the ratio of the number of text fragments corresponding to the type of attribute to the total number of all text fragments to obtain an attribute duty ratio;
selecting the attribute with the largest attribute proportion as the role attribute corresponding to the target sentence, so as to obtain the analysis result corresponding to the role attribute;
and determining the attribute proportion of the attribute with the largest attribute proportion as the confidence coefficient of the analysis result corresponding to the character attribute.
20. The method of claim 1, wherein the analyzing the character attribute corresponding to the target sentence to obtain an analysis result corresponding to the character attribute and a confidence level of the analysis result corresponding to the character attribute comprises:
Acquiring a text to be analyzed, wherein the text to be analyzed comprises the target sentence and the context of the target sentence;
and inputting the text to be analyzed into an attribute identification model to obtain the analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute, which are output by the attribute identification model.
21. The method of claim 1, wherein the identifying, based at least on the target sentence, an initial role name to which the target sentence corresponds comprises:
acquiring a text to be analyzed, wherein the text to be analyzed comprises the target sentence and the context of the target sentence;
extracting initial candidate role names based on the text to be analyzed;
determining the initial candidate character name as a final candidate character name;
and inputting the global character information, the final candidate character names and the text to be analyzed into a character recognition model to obtain character recognition results output by the character recognition model, wherein the character recognition results are used for indicating the character names corresponding to the target sentences as the initial character names.
22. The method of claim 1, wherein, in the case that the target sentence belongs to the multi-role type, after identifying, based at least on the target sentence, an initial role name corresponding to the target sentence, the text analysis of any target sentence in the text to be processed in combination with the global role information further includes:
Acquiring a text to be analyzed, wherein the text to be analyzed comprises the target sentence and the context of the target sentence;
and inputting the global role information, the initial role name and the text to be analyzed into an attribute identification model to obtain an analysis result which is output by the attribute identification model and corresponds to the role attribute.
23. The method of any of claims 1-5, wherein the clustering together names belonging to a same persona of the all personas to obtain at least one set of names in a one-to-one correspondence with at least one persona comprises:
clustering the names belonging to the same role in all the names together to obtain a name set corresponding to all the roles in the text to be processed one by one;
and selecting the roles with the number of the names in the corresponding name set larger than a preset number threshold from all the roles as at least one role.
24. A speech synthesis method comprising the text analysis method according to any one of claims 1 to 23, wherein the speech synthesis method further comprises:
outputting text result information, wherein the text result information comprises an initial text analysis result obtained by the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information;
Receiving text feedback information input by a user;
modifying the initial text analysis result based on the first modification information to obtain a new text analysis result in the case that the text feedback information includes the first modification information related to the initial text analysis result; and
and performing voice synthesis on the target sentence at least based on a final text analysis result to obtain a final synthesized voice corresponding to the target sentence, wherein the final text analysis result is the initial text analysis result without modification of the initial text analysis result, and the final text analysis result is the new text analysis result with modification of the initial text analysis result.
25. A text analysis device, comprising:
the acquisition module is used for acquiring the text to be processed;
the personal name recognition module is used for carrying out personal name recognition on the text to be processed so as to determine all personal names appearing in the text to be processed;
the clustering module is used for clustering the names belonging to the same role in all the names together to obtain at least one name set corresponding to at least one role one by one;
A global determining module, configured to determine global role information based at least on the at least one person name set, where the global role information includes at least one set of role information corresponding to the at least one role in a one-to-one correspondence, each set of role information includes a representative role name of the corresponding role and an alias set, and the alias set includes person names except the representative role name in the person name set of the corresponding role;
the text analysis module is used for carrying out text analysis on any target sentence in the text to be processed in combination with the global role information so as to obtain a text analysis result corresponding to the target sentence, wherein the text analysis comprises analysis on at least one preset item, and the at least one preset item comprises one or more of the following items: the method comprises the steps of analyzing a text type, a character name and a character attribute, wherein the text type is used for judging whether the target sentence belongs to a multi-role type or not, the multi-role type comprises dialogs, and the character attribute comprises character gender and/or character age;
wherein, the text analysis module includes:
an analysis sub-module, configured to analyze a text type of the target sentence to obtain an analysis result corresponding to the text type, where the text analysis result includes the analysis result corresponding to the text type;
The identification sub-module is used for identifying an initial role name corresponding to the target sentence at least based on the target sentence under the condition that the target sentence belongs to the multi-role type;
a retrieval sub-module, configured to retrieve target role information including the initial role name from the global role information, where the target role information corresponds to a target role;
an extraction sub-module, configured to extract, for any specific preset item of the at least one preset item, information corresponding to the specific preset item from the target role information, and determine that the extracted information is an analysis result corresponding to the specific preset item in the text analysis result, where in a case where the specific preset item is a role name, the extracted information is a representative role name in the target role information;
wherein, the text analysis device further includes:
the attribute analysis module is used for analyzing the character attribute corresponding to the target sentence under the condition that the target sentence belongs to the multi-role type before the extraction submodule extracts information corresponding to a specific preset item from the target character information for any specific preset item in the at least one preset item and determines that the extracted information is an analysis result corresponding to the specific preset item in the text analysis result, so as to obtain the analysis result corresponding to the character attribute and the confidence coefficient of the analysis result corresponding to the character attribute, wherein the text analysis result comprises the analysis result corresponding to the character attribute;
The determining module is used for determining the character attribute as the specific preset item if the confidence coefficient of the analysis result corresponding to the character attribute is lower than a preset confidence coefficient threshold value;
the extraction submodule comprises:
an extracting unit configured to extract character attribute information from the target character information;
and the covering unit is used for covering the analysis result corresponding to the character attribute in the text analysis result by using the extracted character attribute information.
26. A speech synthesis apparatus comprising the text analysis apparatus of claim 25, wherein the speech synthesis apparatus further comprises:
the output module is used for outputting text result information, wherein the text result information comprises an initial text analysis result obtained through the step of carrying out text analysis on any target sentence in the text to be processed by combining the global role information;
the receiving module is used for receiving text feedback information input by a user;
the modification module is used for modifying the initial text analysis result based on the first modification information to obtain a new text analysis result when the text feedback information comprises the first modification information related to the initial text analysis result; a kind of electronic device with a high-performance liquid crystal display
And the voice synthesis module is used for carrying out voice synthesis on the target sentence at least based on a final text analysis result so as to obtain a final synthesized voice corresponding to the target sentence, wherein the final text analysis result is the initial text analysis result under the condition that the initial text analysis result is not modified, and the final text analysis result is the new text analysis result under the condition that the initial text analysis result is modified.
27. A text analysis system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the text analysis method of any of claims 1 to 23.
28. A speech synthesis system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the speech synthesis method of claim 24.
29. A storage medium having stored thereon program instructions for performing the text analysis method of any of claims 1 to 23 when run.
30. A storage medium having stored thereon program instructions for performing, when executed, the speech synthesis method of claim 24.
CN202110787732.XA 2021-07-13 2021-07-13 Text analysis and speech synthesis method, device, system and storage medium Active CN113539235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110787732.XA CN113539235B (en) 2021-07-13 2021-07-13 Text analysis and speech synthesis method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110787732.XA CN113539235B (en) 2021-07-13 2021-07-13 Text analysis and speech synthesis method, device, system and storage medium

Publications (2)

Publication Number Publication Date
CN113539235A CN113539235A (en) 2021-10-22
CN113539235B true CN113539235B (en) 2024-02-13

Family

ID=78098729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110787732.XA Active CN113539235B (en) 2021-07-13 2021-07-13 Text analysis and speech synthesis method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN113539235B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863906B (en) * 2022-07-07 2022-10-28 北京中电慧声科技有限公司 Method and device for marking alias of text-to-speech processing

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001331191A (en) * 2000-05-23 2001-11-30 Sharp Corp Device and method for voice synthesis, portable terminal and program recording medium
US9639518B1 (en) * 2011-09-23 2017-05-02 Amazon Technologies, Inc. Identifying entities in a digital work
CN108091321A (en) * 2017-11-06 2018-05-29 芋头科技(杭州)有限公司 A kind of phoneme synthesizing method
CN109523986A (en) * 2018-12-20 2019-03-26 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device, equipment and storage medium
WO2020018724A1 (en) * 2018-07-19 2020-01-23 Dolby International Ab Method and system for creating object-based audio content
CN110767209A (en) * 2019-10-31 2020-02-07 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
CN112434492A (en) * 2020-10-23 2021-03-02 北京百度网讯科技有限公司 Text labeling method and device and electronic equipment
CN112908292A (en) * 2019-11-19 2021-06-04 北京字节跳动网络技术有限公司 Text voice synthesis method and device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8594996B2 (en) * 2007-10-17 2013-11-26 Evri Inc. NLP-based entity recognition and disambiguation
US8355919B2 (en) * 2008-09-29 2013-01-15 Apple Inc. Systems and methods for text normalization for text to speech synthesis
CN110491365A (en) * 2018-05-10 2019-11-22 微软技术许可有限责任公司 Audio is generated for plain text document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001331191A (en) * 2000-05-23 2001-11-30 Sharp Corp Device and method for voice synthesis, portable terminal and program recording medium
US9639518B1 (en) * 2011-09-23 2017-05-02 Amazon Technologies, Inc. Identifying entities in a digital work
CN108091321A (en) * 2017-11-06 2018-05-29 芋头科技(杭州)有限公司 A kind of phoneme synthesizing method
WO2020018724A1 (en) * 2018-07-19 2020-01-23 Dolby International Ab Method and system for creating object-based audio content
CN109523986A (en) * 2018-12-20 2019-03-26 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device, equipment and storage medium
CN110767209A (en) * 2019-10-31 2020-02-07 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
CN112908292A (en) * 2019-11-19 2021-06-04 北京字节跳动网络技术有限公司 Text voice synthesis method and device, electronic equipment and storage medium
CN112434492A (en) * 2020-10-23 2021-03-02 北京百度网讯科技有限公司 Text labeling method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《A feature selection approach for automatic e-book classification based on discourse segmentation》;Guo, Jiunn-Liang et, al.;《PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEM》;第49卷(第1期);2-22 *
《面向语音合成的文本处理技术的改进》;李晓红;《中国优秀硕士学位论文全文数据库 信息科技辑》;1-67 *

Also Published As

Publication number Publication date
CN113539235A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN107818781B (en) Intelligent interaction method, equipment and storage medium
CN107832286B (en) Intelligent interaction method, equipment and storage medium
CN107609101B (en) Intelligent interaction method, equipment and storage medium
CN107797984B (en) Intelligent interaction method, equipment and storage medium
CN111696535B (en) Information verification method, device, equipment and computer storage medium based on voice interaction
CN108287858B (en) Semantic extraction method and device for natural language
US9230547B2 (en) Metadata extraction of non-transcribed video and audio streams
CN109493850B (en) Growing type dialogue device
US20190370398A1 (en) Method and apparatus for searching historical data
EP2801091B1 (en) Method, apparatus and computer program product for joint use of speech and text-based features for sentiment detection
CN107305541A (en) Speech recognition text segmentation method and device
CN104598644B (en) Favorite label mining method and device
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
JP2016051179A (en) Speech recognition method, speech evaluation method, speech recognition system, and speech evaluation system
US20150179173A1 (en) Communication support apparatus, communication support method, and computer program product
CN101309327A (en) Sound chat system, information processing device, speech recognition and key words detectiion
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
US11501546B2 (en) Media management system for video data processing and adaptation data generation
CN111090771A (en) Song searching method and device and computer storage medium
CN109872714A (en) A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN110473543B (en) Voice recognition method and device
CN116312552A (en) Video speaker journaling method and system
CN113539235B (en) Text analysis and speech synthesis method, device, system and storage medium
CN113539234B (en) Speech synthesis method, device, system and storage medium
CN108305629B (en) Scene learning content acquisition method and device, learning equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 1201, Building B, Phase 1, Innovation Park, No. 1 Keyuan Weiyi Road, Laoshan District, Qingdao City, Shandong Province, 266101

Applicant after: Beibei (Qingdao) Technology Co.,Ltd.

Address before: 100192 b303a, floor 3, building B-2, Zhongguancun Dongsheng science and Technology Park, No. 66, xixiaokou Road, Haidian District, Beijing

Applicant before: DATABAKER (BEIJNG) TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant