CN113409766A

CN113409766A - Recognition method, device for recognition and voice synthesis method

Info

Publication number: CN113409766A
Application number: CN202110605363.8A
Authority: CN
Inventors: 林国雯; 周明; 程龙; 姜伟; 曾可璇; 段文君; 刘恺; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-17

Abstract

The embodiment of the invention provides a recognition method, a recognition device and a speech synthesis method. The identification method comprises the following steps: recognizing dialog text in the target text; determining candidate speakers of the current dialog text according to the context of the current dialog text; acquiring the relation characteristics between the candidate speaker and the current dialog text; and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics. The embodiment of the invention can automatically identify the target speaker of each dialog text in the target text, reduce the labor cost, improve the identification efficiency and improve the accuracy of identifying the target speaker.

Description

Recognition method, device for recognition and voice synthesis method

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a recognition method, a recognition device, and a speech synthesis method.

Background

The audio literary works are concerned more and more, and for the literary works with multiple roles, the roles to which each conversation belongs in the literary works need to be distinguished, so that the studios with different roles can quickly and accurately record the self speech-line parts.

However, at present, the roles to which each dialog in the text work belongs are usually identified by manually reading through the literary work, which not only needs to consume a large amount of labor cost, but also has low identification efficiency.

Disclosure of Invention

Embodiments of the present invention provide an identification method, an identification device, and a speech synthesis method, which can automatically identify a target speaker of each dialog text in a target text, reduce labor cost, and improve identification efficiency.

In order to solve the above problem, an embodiment of the present invention discloses an identification method, where the method includes:

recognizing dialog text in the target text;

determining candidate speakers of the current dialog text according to the context of the current dialog text;

acquiring the relation characteristics between the candidate speaker and the current dialog text;

and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics.

Optionally, the determining a candidate speaker of the current dialog text according to the context of the current dialog text includes:

inputting the context of the current dialog text sentence by sentence into a recognition model, and recognizing the designation in the context;

and using the identified reference as a candidate speaker of the current dialog text.

Optionally, the method further comprises:

identifying whether each name in the target text corresponds to the same entity;

and carrying out coreference resolution on the names corresponding to the same entity to obtain all the dialog texts of the same role.

Optionally, the method further comprises:

acquiring a target dialog text in the target text and a target speaker of the target dialog text;

and carrying out voice synthesis on the target dialog text according to the role characteristics of the target speaker of the target dialog text and the dialog scene characteristics of the target dialog text to obtain voice synthesis data of the target dialog text.

Optionally, the determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text, and the relationship characteristic includes:

inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text and the relationship characteristics into a prediction model to predict the score of each candidate speaker as a target speaker;

at least one target speaker of the current dialog text is determined from the candidate speakers according to the predicted score of each candidate speaker.

Optionally, the inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship characteristic into a prediction model to predict the score of each candidate speaker as the target speaker includes:

acquiring input data corresponding to each candidate speaker, wherein the input data corresponding to the current candidate speaker comprises: the current dialog text, the context of the current dialog text, the current candidate speaker of the current dialog text, and the relationship characteristics between the current candidate speaker and the current dialog text;

and sequentially inputting the input data corresponding to each candidate speaker into the prediction model, and respectively predicting the score of each candidate speaker as the target speaker.

combining every two candidate speakers of the current conversation text to obtain a candidate speaker combination;

acquiring input data corresponding to each candidate speaker combination, wherein the input data corresponding to the current candidate speaker combination comprises: the current dialog text, the context of the current dialog text, the current candidate speaker combination of the current dialog text, and the relationship characteristics between each candidate speaker in the current candidate speaker combination and the current dialog text;

and sequentially inputting the input data corresponding to each candidate speaker combination into a prediction model, and respectively predicting the score of each candidate speaker in each candidate speaker combination as the target speaker.

and inputting the current dialog text, the context of the current dialog text, all candidate speakers of the current dialog text and the relationship characteristics between each candidate speaker and the current dialog text into a prediction model together, and predicting the score of each candidate speaker as the target speaker.

Optionally, before determining the candidate speaker of the current dialog text according to the context of the current dialog text, the method further includes:

sampling the target text to obtain a sampled text;

determining a sampling index of which the occurrence times in the sampling text meet a preset condition;

the determining the candidate speaker of the current dialog text according to the context of the current dialog text comprises:

the speaker that appears in the sample designation and in the context of the current dialog text is determined to be a candidate for the current dialog text.

Optionally, the relationship characteristics between the candidate speaker and the current dialog text include any one or more of: the distance between the candidate speaker and the current dialog text, whether the candidate speaker is in a span with the current dialog text, and the number of times the candidate speaker appears in the context of the current dialog text.

Optionally, the recognizing dialog text in the target text includes:

identifying any one or more of the following in the target text: dialogue text, bystander text, monologue text, and inner monologue text.

On the other hand, the embodiment of the invention discloses a voice synthesis method, which comprises the following steps:

determining at least one target speaker for each dialog text in the target text by using the recognition method of any one of the preceding claims;

and synthesizing voice data of the corresponding dialog text according to at least one target speaker of each dialog text in the target text.

In another aspect, an embodiment of the present invention discloses an identification apparatus, where the apparatus includes:

the dialogue identification module is used for identifying dialogue texts in the target texts;

the candidate determining module is used for determining candidate speakers of the current dialog text according to the context of the current dialog text;

the characteristic acquisition module is used for acquiring the relation characteristic between the candidate speaker and the current conversation text;

and the target determining module is used for determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics.

Optionally, the candidate determining module includes:

the model identification submodule is used for inputting the context of the current dialog text into an identification model sentence by sentence and identifying the designation in the context;

and the candidate determining submodule is used for taking the identified name as a candidate speaker of the current dialog text.

Optionally, the apparatus further comprises:

the entity identification module is used for identifying whether all the references in the target text correspond to the same entity or not;

and the coreference resolution module is used for coreference resolution of the names corresponding to the same entity to obtain all the dialog texts of the same role.

Optionally, the apparatus further comprises:

the target acquisition module is used for acquiring a target conversation text in the target text and a target speaker of the target conversation text;

and the voice synthesis module is used for carrying out voice synthesis on the target dialogue text according to the role characteristics of the target speaker of the target dialogue text and the dialogue scene characteristics of the target dialogue text to obtain voice synthesis data of the target dialogue text.

Optionally, the goal determination module includes:

a score prediction sub-module, configured to input the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship characteristics into a prediction model, and predict a score of each candidate speaker as a target speaker;

and the target determining submodule is used for determining at least one target speaker of the current dialog text from the candidate speakers according to the predicted score of each candidate speaker.

Optionally, the score predictor sub-module comprises:

the first obtaining unit is used for obtaining input data corresponding to each candidate speaker, wherein the input data corresponding to the current candidate speaker comprises: the current dialog text, the context of the current dialog text, the current candidate speaker of the current dialog text, and the relationship characteristics between the current candidate speaker and the current dialog text;

and the first prediction unit is used for sequentially inputting the input data corresponding to each candidate speaker into the prediction model and respectively predicting the score of each candidate speaker as the target speaker.

Optionally, the score predictor sub-module comprises:

the candidate combination unit is used for pairwise combining all candidate speakers of the current conversation text to obtain a candidate speaker combination;

a second obtaining unit, configured to obtain input data corresponding to each candidate speaker combination, where the input data corresponding to the current candidate speaker combination includes: the current dialog text, the context of the current dialog text, the current candidate speaker combination of the current dialog text, and the relationship characteristics between each candidate speaker in the current candidate speaker combination and the current dialog text;

and the second prediction unit is used for sequentially inputting the input data corresponding to each candidate speaker combination into the prediction model and respectively predicting the score of each candidate speaker in each candidate speaker combination as the target speaker.

Optionally, the score predictor sub-module comprises:

and the third prediction unit is used for inputting the current dialog text, the context of the current dialog text, all candidate speakers of the current dialog text and the relationship characteristics between each candidate speaker and the current dialog text into a prediction model together, and predicting the score of each candidate speaker as the target speaker.

Optionally, the apparatus further comprises:

the text sampling module is used for sampling the target text to obtain a sampled text;

the sampling selection module is used for determining a sampling index of which the occurrence times in the sampling text meet a preset condition;

the candidate determination module is specifically configured to determine a designation that appears in the sampling designation and in the context of the current dialog text as a candidate speaker of the current dialog text.

Optionally, the dialog recognition module is specifically configured to recognize any one or more of the following texts in the target text: dialogue text, bystander text, monologue text, and inner monologue text.

In another aspect, an embodiment of the present invention discloses a speech synthesis apparatus, where the apparatus includes:

a matching module for determining at least one target speaker of each dialog text in the target text by using the recognition method of any one of the preceding claims;

and the synthesis module is used for synthesizing the voice data of the corresponding dialog text according to at least one target speaker of each dialog text in the target text.

In yet another aspect, an apparatus for identification is disclosed, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for performing the identification method of any of the above claims.

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform an identification method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the recognition method provided by the embodiment of the invention can be used for automatically recognizing the target speaker corresponding to each dialog text in the target text. Specifically, a dialog text in a target text is firstly identified, and a candidate speaker of the current dialog text is determined according to the context of the current dialog text. And then, acquiring the relation characteristics between the candidate speaker and the current dialog text, and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relation characteristics. The relation characteristics between the candidate speaker and the current dialogue text can reflect the relevance between the candidate speaker and the current dialogue text, and further can be used as effective parameters for determining the target speaker of the current dialogue text, so that the accuracy of determining the target speaker is improved. The embodiment of the invention combines the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relation characteristic multifaceted factors between the candidate speaker and the current dialog text to comprehensively determine at least one target speaker of the current dialog text, thereby improving the accuracy of identifying the target speaker. And can reduce the labor cost and improve the recognition efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of an embodiment of an identification method of the present invention;

FIG. 2 is a flow chart of the steps of one embodiment of a speech synthesis method of the present invention;

FIG. 3 is a block diagram of an embodiment of an identification device of the present invention;

FIG. 4 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus 800 for identification of the present invention;

fig. 6 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an identification method of the present invention is shown, which may specifically include the following steps:

step 101, recognizing a dialog text in a target text;

102, determining candidate speakers of a current dialog text according to the context of the current dialog text;

103, acquiring the relation characteristics between the candidate speaker and the current dialog text;

and step 104, determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics.

The identification method provided by the invention can be applied to electronic equipment, and the electronic equipment comprises but is not limited to: a server, a smart phone, a recording pen, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a car computer, a desktop computer, a set-top box, a smart tv, a wearable device, and the like.

The recognition method provided by the invention can be used for automatically recognizing the target speaker corresponding to each dialog text in the target text. The target text may be electronic text, such as a novel e-book, a script, or the like.

The embodiment of the invention firstly identifies the dialog text in the target text. Further, the recognizing the dialog text in the target text comprises: identifying any one or more of the following in the target text: dialogue text, bystander text, monologue text, and inner monologue text.

Wherein the spoken text refers to conversational text describing at least two characters between. The bystander text refers to descriptive text that supplements and explains the scenario. Unigram text refers to text that describes a person's own language. The inner monologue text refers to text describing the mental activities of a person.

In particular implementations, any available method of recognizing dialog text in existing schemes may be employed. Illustratively, the dialog text in the target text can be identified by performing text processing such as text sentence splitting, sentence structure recognition, sentence semantic analysis and the like on the target text.

In the embodiment of the invention, the text in the quotation marks containing the full-term quotation marks is determined as the dialogue text by performing text processing on the target text, identifying the quotation marks in the target text, judging whether the full-term quotation marks contain the full-term quotation marks or not.

In practical applications, speakers of dialog text will typically appear in close proximity to and around the dialog text, i.e., the speakers of dialog text will typically be in the context of the dialog text. Therefore, for the identified current dialog text, the embodiment of the invention determines the candidate speaker of the current dialog text according to the context of the current dialog text, acquires the relationship feature between the candidate speaker and the current dialog text, and finally determines at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship feature. And processing the next dialog text by adopting the same method, and taking the next dialog text as the current dialog text to determine the target speaker of the next dialog text.

The context of the current dialog text may include the first N sentences of text (above) and the last M sentences of text (below) of the current dialog text, among others. The specific values of M and N may be set according to experimental statistics or experience. In one example, the first 5 sentences and the last 5 sentences of the current dialog text are selected as the context of the current dialog text. In implementations, other dialog text may be included in the context of the current dialog text.

In an optional embodiment of the present invention, the relationship characteristic between the candidate speaker and the current dialog text may include at least any one or more of the following: the distance between the candidate speaker and the current dialog text, whether the candidate speaker is in a span with the current dialog text, and the number of times the candidate speaker appears in the context of the current dialog text.

The relation characteristic between the candidate speaker and the current dialogue text can reflect the relevance between the candidate speaker and the current dialogue text, and further the relation characteristic can be used as an effective parameter for determining a target speaker of the current dialogue text, so that the accuracy of determining the target speaker is improved.

The distance between the candidate speaker and the current dialog text refers to the distance between the position of the candidate speaker in the context and the current dialog text, and a smaller distance means a stronger association between the candidate speaker and the current dialog text. The embodiment of the invention does not limit the expression mode of the distance between the candidate speaker and the current dialog text. In one example, the distance between the candidate speaker and the current dialog text may be represented by the number of characters between the candidate speaker and the current dialog text. The number of characters may be the number of chinese characters, words, characters, etc.

Whether the candidate speaker and the current dialog text are in a cross section or not refers to whether the candidate speaker and the current dialog text belong to the same paragraph or not, and if not, the candidate speaker and the current dialog text belong to the same paragraph or not, the candidate speaker and the current dialog text are in the cross section. If no cross section exists, the association between the candidate speaker and the current dialog text is stronger, or the cross section is less, the association between the candidate speaker and the current dialog text is stronger.

The greater the number of occurrences of the candidate speaker in the context of the current dialog text, the greater the association between the candidate speaker and the current dialog text.

And finally, comprehensively determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics between the candidate speaker and the current dialog text, so that the accuracy of recognizing the target speaker can be ensured, and the recognition efficiency can be improved. In addition, the embodiment of the invention determines at least one target speaker in the candidate speakers of the current dialog text, so that the situation that the target speaker does not exist in the dialog text can be avoided, and further, the abnormal situation caused by the empty target speaker in the subsequent processing can be avoided.

In an optional embodiment of the present invention, the determining the candidate speaker of the current dialog text according to the context of the current dialog text comprises:

step S11, inputting the context of the current dialog text sentence by sentence into an identification model, and identifying the designation in the context;

and step S12, the identified reference is used as a candidate speaker of the current dialog text.

The recognition model can be a neural network model trained in advance and can be used for recognizing all the references in the text. A term refers to the name of a speaker in the text, and may be a name (e.g., zhang), a name (e.g., xiaohong), a pronoun reference (e.g., he, she, they, etc.), a noun reference (e.g., mr. zhang, lailao, bangbo, liu Sir, etc.), and so on.

The embodiment of the invention inputs the context of the current dialog text into the recognition model sentence by sentence, can automatically recognize all the references in the context, and takes all the recognized references as the candidate speakers of the current dialog text.

The embodiment of the invention does not limit the model structure and the training mode of the recognition model. In one example, the recognition model may be a pre-trained language model, such as a BERT model or an electrora model. The model structure of the recognition model can be a BERT model or an Electrora model plus a full connection layer.

Pre-training refers to training that is done before training the model with sample data. The pre-training aims to train partial models of the middle and bottom layers and the commonalities of the downstream tasks in advance, and then train respective models by using respective sample data of the downstream tasks, so that the convergence speed can be greatly increased. The pre-trained BERT model or electrora model may be Fine-tuned (Fine-tuning stage) when being subsequently used for a specific NLP (Natural Language Processing) task, and may be applicable to a variety of different NLP tasks.

The training data for training the recognition model can be text data containing a label, and the trained recognition model can be obtained after the training data is utilized to finely adjust the pre-trained BERT model or the electric model.

In an optional embodiment of the present invention, the determining at least one speaker target of the current dialog text according to the current dialog text, the context of the current dialog text, the speaker candidates of the current dialog text, and the relationship characteristic includes:

step S21, inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text and the relationship characteristic into a prediction model, and predicting the score of each candidate speaker as the target speaker;

step S22, determining at least one target speaker of the current dialog text from the candidate speakers according to the predicted score of each candidate speaker.

The prediction model may be a pre-trained neural network model, and may be used to predict the score of the candidate speaker as the target speaker. The higher the score, the higher the probability that the candidate speaker is the target speaker of the current dialog text.

The embodiment of the invention does not limit the model structure and the training mode of the prediction model. In one example, the predictive model may be a pre-trained language model such as a BERT model or an electrora model. The model structure of the prediction model can be a BERT model or an Electrora model plus a full connection layer.

The training data for training the prediction model may be text data including labeled dialogue texts and speakers labeled to which the dialogue texts correspond, and the training data includes positive examples and negative examples. Take the example that < quote, piece > represents training data, where quote represents dialog text and piece represents a reference that appears in the context of the dialog text quote. Let < dialog text, real speaker > be positive examples and < dialog, other roles that have appeared in context > be negative examples.

After the pre-trained BERT model or Electrora model is subjected to fine tuning by using the training data, a trained prediction model can be obtained. The output of the trained predictive model is the predicted score, e.g., a score of 0.999 (1 highest, 0 lowest).

In example one, the target text contains the following partial text:

and in winter, people can not feel comfortable, people can take the mobile phone to change equipment and call animals, and then the people can get down copies in team, so that the actions are skillful and fluent, and the people can see that the people are qualified family men without living at night.

The Hanzi is well concerned: "in which area you are in, can you not hold me in separate thighs? "

Winter solstice crying laughing: "two-in-one area, you add my friend bar. "

Two people chat each other, and the winter solstice knows what the Chinese is called and goes to the Changchun, but the department is travelling.

Firstly, through text analysis processing, the quotation marks in the target text and whether the quotation marks contain the end punctuations of the periods are identified so as to identify the dialog text in the quotation marks and acquire the context of the dialog text.

In example one, the following two dialog texts are identified: "in which area you are in, can you not hold me in separate thighs? "and" Zhan two same area, you add my friend bar. "

For the first dialog text (i.e., the current dialog text) "is a big man in which region, can not divide into thighs to hold me? If the above text of the dialog text is obtained to include 'winter solstice and no passenger gas', the mobile phone is taken to start equipment change and call the beast, and then the team is put down to copy, the action is proficient and fluent, and the user is a qualified male residence without a night life at a glance. "and" Hanzi Gansu: ". The following text to obtain this dialog includes "the winter to crying laugh: "two-in-one area, you add my friend bar. "and" go from one to two, two people chat, winter solstice knows what the Chinese is called and go to Changchun, but it is a department tour. ". In this example, the context of the first dialog text includes another dialog "a friends and friends area you add to me bar". ".

Then, inputting the obtained upper and lower sentences of the first dialog text into a recognition model, and recognizing the following three references: "winter solstice", "Hanzi" and "what chance" refer to these three speakers as candidate speakers.

Next, a relationship characteristic between each candidate speaker and the first dialog text is obtained. Taking the candidate speaker 'hanzi' as an example, the relationship characteristics between the candidate speaker 'hanzi' and the first dialog text are obtained as follows: the distance between the candidate speaker 'hanzi' and the first dialog text is 5, the candidate speaker 'hanzi' has no span with the first dialog text, and the number of times the candidate speaker 'hanzi' appears in the context of the first dialog text is 2. Likewise, the relationship features between the candidate speakers "winter solstice" and "what", respectively, and the first dialog text may be obtained.

Finally, inputting the first dialog text, the context of the first dialog text, the candidate speaker of the first dialog text and the relationship characteristic between the candidate speaker of the first dialog text and the first dialog text into a prediction model, and predicting the score of each candidate speaker as the target speaker; a target speaker of the first dialog text is determined from the candidate speakers based on the predicted score for each candidate speaker. Such as determining the candidate speaker with the highest score as the target speaker of the first dialog text.

The second dialog text is processed in the same way, and the target speaker of the second dialog text can be obtained.

For step S21, the present invention provides three alternative implementations as follows.

step A1, obtaining input data corresponding to each candidate speaker, wherein the input data corresponding to the current candidate speaker includes: the current dialog text, the context of the current dialog text, the current candidate speaker of the current dialog text, and the relationship characteristics between the current candidate speaker and the current dialog text;

and step A2, sequentially inputting the input data corresponding to each candidate speaker into the prediction model, and respectively predicting the score of each candidate speaker as the target speaker.

The alternative is to input the prediction scores of the prediction models separately for multiple candidate speakers of the current dialog text, that is, each time the prediction model is input, the input data corresponding to one candidate is input.

Assuming that there are A, B, C three candidate speakers in the current dialog text, the score of the candidate speaker a as the target speaker can be obtained by inputting the current dialog text, the context of the current dialog text, the current candidate speaker (e.g., candidate speaker a) in the current dialog text, and the relationship characteristics between the current candidate speaker (candidate speaker a) and the current dialog text into the prediction model.

The scores of the candidate speakers B and C as the target speakers are predicted respectively in the same manner. And finally, selecting the candidate speaker with the highest score as the target speaker of the current dialog text.

Taking the first dialog text in the first example as an example, the first dialog text, the context of the first dialog text, the current candidate speaker (e.g., the candidate speaker "hanzi") of the first dialog text, and the relationship characteristics between the current candidate speaker ("hanzi") and the first dialog text are input into the prediction model, so as to obtain the score of the candidate speaker "hanzi" as the target speaker. The scores of candidate speakers 'winter solstice' and 'chance' are respectively predicted as the target speakers by the same method. The candidate speaker with the highest score (e.g., "Chinese") is selected as the target speaker for the first dialog text.

step B1, combining all candidate speakers of the current dialog text pairwise to obtain candidate speaker combinations;

step B2, obtaining input data corresponding to each candidate speaker combination, wherein the input data corresponding to the current candidate speaker combination comprises: the current dialog text, the context of the current dialog text, the current candidate speaker combination of the current dialog text, and the relationship characteristics between each candidate speaker in the current candidate speaker combination and the current dialog text;

and step B3, sequentially inputting the input data corresponding to each candidate speaker combination into the prediction model, and respectively predicting the score of each candidate speaker in each candidate speaker combination as the target speaker.

The second alternative is to combine every two candidate speakers of the current dialog text to obtain candidate speaker combinations, and predict each candidate speaker combination respectively, that is, each time the prediction model is input, the input data corresponding to one candidate speaker combination is input.

Assuming that there are A, B, C candidate speakers in the current dialog text, combining A, B, C three candidate speakers two by two can obtain the following candidate speaker combinations: (A, B), (A, C), (B, C).

Inputting the current dialog text, the context of the current dialog text, the current candidate speaker combination (such as the candidate speaker combination (A, B)) of the current dialog text and the relationship characteristics between each candidate speaker in the current candidate speaker combination and the current dialog text (such as the relationship characteristics between the candidate speaker A and the current dialog text and the relationship characteristics between the candidate speaker B and the current dialog text) into a prediction model to obtain the score of the candidate speaker A and the score of the candidate speaker B in the candidate speaker combination (A, B).

The same approach deals with candidate speaker combinations (A, C) and (B, C). The scores of the three candidate speaker combinations are compared and the higher scoring term is selected. For example, if the score of candidate speaker A in the candidate speaker group (A, B) is higher, then the other candidate speaker groups including candidate speaker A are continuously compared, for example, the candidate speaker group (A, C) is continuously compared, and if the score of candidate speaker C in (A, C) is higher than A, then the target speaker is determined as C.

Taking the first dialog text in example one as an example, two-by-two combination of three candidate speakers "hanzi", "winter solstice", and "what chance" can be obtained as the following candidate speaker combination: (winter solstice, Hanzi), (winter solstice, meeting), (Hanzi, meeting). Assuming that the score of "hanzi" in the candidate speaker combination (winter solstice, hanzi) is higher than "winter solstice", the other candidate speaker combinations including the candidate speaker "hanzi" are continuously compared, that is, the candidate speaker combination (hanzi, chance) is continuously compared, and in the combination, the score of the candidate speaker "hanzi" is higher than the score of the candidate speaker "chance", so that the "hanzi" can be determined as the target speaker of the first dialog text.

The third alternative is to input the prediction scores of the prediction models to multiple candidate speakers of the current dialog text at the same time, that is, the input prediction models are input with the input data corresponding to all the candidate speakers.

Taking the first dialog text in the first example as an example, the first dialog text, the context of the first dialog text, all candidate speakers ("winter solstice", "han son", "chance") of the first dialog text, and the relationship characteristics between each candidate speaker and the first dialog text (the relationship characteristics between the candidate speaker "winter solstice" and the first dialog text, the relationship characteristics between the candidate speaker "han son" and the first dialog text, and the relationship characteristics between the candidate speaker "chance" and the first dialog text) are input into the prediction model together, and the model outputs scores corresponding to the three candidate speakers respectively.

Assuming the prediction model is a five-classification model, the model outputs the following five scores: [0.1,0.99,0.32,0,0], wherein 0.1 is a score corresponding to the candidate speaker "winter solstice", 0.99 is a score corresponding to the candidate speaker "hanzi", 0.32 is a score corresponding to the candidate speaker "what meets", and the candidate speaker "hanzi" can be determined as the target speaker of the first dialog text.

It should be noted that, in the specific implementation, the model structure, the input data, and the output data of the prediction model may all be different for the above three alternatives.

In an optional embodiment of the present invention, before determining the speaker candidate of the current dialog text according to the context of the current dialog text, the method may further comprise:

step S31, sampling the target text to obtain a sampled text;

step S32, determining the sampling index of which the occurrence times in the sampling text meet the preset conditions;

the determining the candidate speaker of the current dialog text according to the context of the current dialog text comprises: the speaker that appears in the sample designation and in the context of the current dialog text is determined to be a candidate for the current dialog text.

Before determining a candidate of a current dialog text, sampling a target text to obtain a sampled text, identifying all the indexes in the sampled text, and selecting the sampling indexes of which the occurrence times in the sampled text meet preset conditions. Sampling refers to a high frequency indication that is of greater interest to the user. For example, all the indexes in the sample text are sorted from top to bottom according to the occurrence times, and the index K before the sorting is selected to be the sample index. For example, for target text, the system automatically identifies 100 designations, but the user may only focus on the top 20 of the high frequencies, and other designations may be of non-significant role, and then selects the designation with the top 20 occurrences as the sampling designation.

After determining the sampling designation, when determining a candidate speaker for the current dialog text, determining a designation that appears in the sampling designation and in the context of the current dialog text as the candidate speaker for the current dialog text. Taking the first dialog text in example one as an example, the target text is sampled and sample designations are determined, assuming that 20 sample designations are determined. In determining the candidate speaker for the first dialog text, "winter solstice," "hanzi," "encounter," is determined to be the candidate speaker for the first dialog text, since the designations "winter solstice," "hanzi," "encounter" appear in the sampling designations and in the context of the first dialog text. The other 17 of the sample's designations do not appear in the context of the first dialog text and therefore cannot be candidates for the first dialog text.

Therefore, the embodiment of the invention can ensure that the determined candidate speaker of the current dialog text is the high-frequency finger in the target text, can meet the actual requirements of users, and can further improve the accuracy of determining the candidate speaker.

In an optional embodiment of the invention, the method may further comprise:

step S41, identifying whether the names in the target text correspond to the same entity;

and step S42, carrying out coreference resolution on the names corresponding to the same entity to obtain all the dialog texts of the same role.

After all the references in the target text are recognized, the method and the device can carry out coreference resolution on the recognized references. The coreference resolution refers to binding the names of the same entities, and further all the dialog texts of the same role can be integrated. And the subsequent further processing of the conversation texts of different characters is facilitated, such as recording conversation audio of different characters, synthesizing voice data of different characters and the like.

For example, in example one, it can be known that "hanzi" and "what" correspond to the same entity, i.e. the same role, through the text "how hanzi is known in winter solstice". The embodiment of the invention carries out coreference resolution on the same entity. For example, the name "hanzi" is bound with "what", for example, all the terms are bound as the role "what", so that all the dialog texts of the role "what" can be integrated.

The embodiment of the present invention does not limit the specific manner of identifying whether the designations correspond to the same entities. For example, a classification model may be trained in advance, and the training data of the classification model may be text data containing a label reference indicating whether the labels are the same entity and the label reference. After training is completed, the names and the texts of the names are input into the classification model, and the score of whether the names are the same entity or not can be output.

In an optional embodiment of the invention, the method may further comprise:

step S51, acquiring a target dialog text in the target text and a target speaker of the target dialog text;

step S52, according to the role characteristics of the target speaker of the target dialogue text and the dialogue scene characteristics of the target dialogue text, carrying out voice synthesis on the target dialogue text to obtain voice synthesis data of the target dialogue text.

After the target speaker is identified for the dialog text in the target text, the embodiment of the invention can also carry out voice synthesis on the target dialog text in the target text according to the role to obtain the voice synthesis data of the target dialog text. By playing the voice synthesis data of different roles, more intuitive hearing experience can be brought to listeners.

The target dialog text may be the dialog text specified in the target text, or may be all of the dialog texts in the target text. Wherein the character features include, but are not limited to, any one or more of the following: character characteristics, character tone characteristics, character gender characteristics and character age characteristics; the dialog scene features include, but are not limited to, any one or more of the following: conversation tone characteristics, conversation emotion characteristics and conversation place characteristics.

In the embodiment of the present invention, the target dialog text may include any one or more of the following texts: dialogue text, bystander text, monologue text, and inner monologue text. The embodiment of the invention can carry out voice synthesis on the dialogue texts between the roles in the target text according to the roles, and can also carry out voice synthesis on the bystander texts, the monolingual texts and the internal monolingual texts of the roles in the target text according to the roles, so that the obtained voice synthesis data is more targeted and richer and more diversified, and further brings more intuitive and personally-looking hearing experience to audiences.

In summary, the recognition method provided by the embodiment of the invention can be used for automatically recognizing the target speaker corresponding to each dialog text in the target text. Specifically, a dialog text in a target text is firstly identified, and a candidate speaker of the current dialog text is determined according to the context of the current dialog text. And then, acquiring the relation characteristics between the candidate speaker and the current dialog text, and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relation characteristics. The relation characteristics between the candidate speaker and the current dialogue text can reflect the relevance between the candidate speaker and the current dialogue text, and further can be used as effective parameters for determining the target speaker of the current dialogue text, so that the accuracy of determining the target speaker is improved. The embodiment of the invention combines the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relation characteristic multifaceted factors between the candidate speaker and the current dialog text to comprehensively determine at least one target speaker of the current dialog text, thereby improving the accuracy of identifying the target speaker. And can reduce the labor cost and improve the recognition efficiency.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a speech synthesis method according to the present invention is shown, which may specifically include the following steps:

step 201, determining at least one target speaker of each dialogue text in the target text by using the identification method of any one of claims 1 to 11;

step 202, synthesizing voice data of the corresponding dialog text according to at least one target speaker of each dialog text in the target text.

The speech synthesis provided by the present invention is applicable to electronic devices including, but not limited to: a server, a smart phone, a recording pen, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a car computer, a desktop computer, a set-top box, a smart tv, a wearable device, and the like.

In a specific implementation, the identification method shown in fig. 1 may be performed in an electronic device. For example, a target text to be synthesized may be input into the electronic device, and the electronic device determines at least one target speaker of each dialog text in the target text by executing the recognition method shown in fig. 1, and then synthesizes voice data corresponding to the dialog text according to the at least one target speaker of each dialog text in the target text.

The speech synthesis method provided by the embodiment of the invention firstly automatically carries out dialog text recognition and speaker matching on a target text, and determines at least one target speaker of each dialog text in the target text; then, the voice synthesis is carried out on each speaker text according to the at least one identified target speaker, and synthesized voice data is obtained.

For example, 100 dialog texts are recognized in the target text, at least one target speaker can be determined for each dialog text in the 100 dialog texts, and then, for each dialog text, speech synthesis is performed according to the determined corresponding target speaker to obtain speech data of each dialog text.

Further, the recognized dialog text may include any one or more of the following: dialogue text, bystander text, monologue text, and inner monologue text. The embodiment of the invention can match different roles in the target text with the dialog text, perform voice synthesis on the dialogue text between different roles according to the roles, and perform voice synthesis on the bystander text, the monolingual text and the inner monolingual text of the roles according to the roles, so that the obtained voice synthesis data is more targeted and richer and diversified, and further brings more visual and personally-immersive hearing experience to audiences.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 3, a block diagram of an embodiment of an identification apparatus of the present invention is shown, which may include:

a dialog recognition module 301, configured to recognize a dialog text in the target text;

a candidate determining module 302, configured to determine a candidate speaker of a current dialog text according to a context of the current dialog text;

a feature obtaining module 303, configured to obtain a relationship feature between the candidate speaker and the current dialog text;

and the target determining module 304 is configured to determine at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship characteristic.

Optionally, the candidate determining module includes:

Optionally, the apparatus further comprises:

Optionally, the goal determination module includes:

Optionally, the score predictor sub-module comprises:

Optionally, the apparatus further comprises:

The recognition device provided by the embodiment of the invention can be used for automatically recognizing the target speaker corresponding to each dialogue text in the target text. Specifically, a dialog text in a target text is recognized through a dialog recognition module, and a candidate speaker of the current dialog text is determined through a candidate determination module according to the context of the current dialog text. And then, acquiring the relationship characteristics between the candidate speakers and the current dialog text through a characteristic acquisition module, and determining at least one target speaker of the current dialog text through a target determination module according to the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text and the relationship characteristics. The relation characteristics between the candidate speaker and the current dialogue text can reflect the relevance between the candidate speaker and the current dialogue text, and further can be used as effective parameters for determining the target speaker of the current dialogue text, so that the accuracy of determining the target speaker is improved. The embodiment of the invention combines the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relation characteristic multifaceted factors between the candidate speaker and the current dialog text to comprehensively determine at least one target speaker of the current dialog text, thereby improving the accuracy of identifying the target speaker. And can reduce the labor cost and improve the recognition efficiency.

Referring to fig. 4, a block diagram of a speech synthesis apparatus of an embodiment of the present invention is shown, which may include:

a matching module 401 for determining at least one target speaker of each dialog text in the target text by using the recognition method of any one of claims 1 to 11;

a synthesizing module 402, configured to synthesize voice data of a corresponding dialog text according to at least one target speaker of each dialog text in the target text.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for identification, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs including instructions for: recognizing dialog text in the target text; determining candidate speakers of the current dialog text according to the context of the current dialog text; acquiring the relation characteristics between the candidate speaker and the current dialog text; and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics.

Fig. 5 is a block diagram illustrating an apparatus 800 for identification, according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as the display and keypad of the apparatus 800, the sensor assembly 814 may also test for changes in the position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and temperature changes of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the identification method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform an identification method, the method comprising: recognizing dialog text in the target text; determining candidate speakers of the current dialog text according to the context of the current dialog text; acquiring the relation characteristics between the candidate speaker and the current dialog text; and determining at least one target speaker of the current dialog text according to the current dialog text, the context of the current dialog text, the candidate speaker of the current dialog text and the relationship characteristics.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The recognition method, the recognition device and the speech synthesis method provided by the invention are described in detail, and specific examples are applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiments is only used to help understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An identification method, characterized in that the method comprises:

recognizing dialog text in the target text;

2. The method of claim 1, wherein determining candidate speakers for the current dialog text based on the context of the current dialog text comprises:

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein determining at least one target speaker for the current dialog text based on the current dialog text, the context of the current dialog text, the candidate speakers for the current dialog text, and the relationship features comprises:

6. The method of claim 5, wherein said inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship feature into a prediction model to predict the score of each candidate speaker as the target speaker comprises:

7. The method of claim 5, wherein said inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship feature into a prediction model to predict the score of each candidate speaker as the target speaker comprises:

8. The method of claim 5, wherein said inputting the current dialog text, the context of the current dialog text, the candidate speakers of the current dialog text, and the relationship feature into a prediction model to predict the score of each candidate speaker as the target speaker comprises:

9. The method of claim 1, wherein before determining the speaker candidate for the current dialog text based on the context of the current dialog text, the method further comprises:

sampling the target text to obtain a sampled text;

10. The method of any of claims 1 to 9, wherein the relationship between the candidate speaker and the current dialog text comprises any one or more of: the distance between the candidate speaker and the current dialog text, whether the candidate speaker is in a span with the current dialog text, and the number of times the candidate speaker appears in the context of the current dialog text.

11. The method of any one of claims 1 to 9, wherein the identifying dialog text in the target text comprises:

12. A method of speech synthesis, the method comprising:

determining at least one target speaker of each dialog text in the target text by using the recognition method of any one of claims 1 to 11;

13. An identification device, the device comprising:

14. An apparatus for identification, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by the one or more processors comprise instructions for performing the identification method of any one of claims 1 to 11.

15. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the identification method of any of claims 1 to 11.