CN115312066A

CN115312066A - Method and apparatus for recognizing speaker for text and training speaker recognition model

Info

Publication number: CN115312066A
Application number: CN202210865694.XA
Authority: CN
Inventors: 申柯秋; 徐东; 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-08

Abstract

The application discloses a method and a device for recognizing a speaker in a text and training a speaker recognition model, and belongs to the technical field of computers. The method comprises the following steps: acquiring a target pair white sentence in a target text; carrying out speaker name recognition on adjacent sentences of the target pair of white sentences based on the speaker list corresponding to the target text; determining speaker-dependent sentences of the target pair of white sentences in the adjacent sentences based on speaker name recognition results of the adjacent sentences; and if the speaker-dependent sentence of the target pair of the white sentences is determined, inputting the target pair of the white sentences and the speaker-dependent sentence into a trained speaker recognition model to obtain the speaker information of the target pair of the white sentences. By adopting the method and the device, the recognition accuracy of the speaker recognition model can be improved.

Description

Method and apparatus for recognizing speaker for text and training speaker recognition model

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for recognizing a speaker in a text and training a speaker recognition model.

Background

With the continuous development of speech synthesis technology, the reading of voiced books is widely promoted. When the vocal book is read clearly, the reading mode of dubbing of multiple speakers is adopted, so that the experience of the user can be greatly improved. Implementing a speakable mode for multi-speaker dubbing requires determining the speaker of the conversation in the text. Currently, most of the text of a complete article is input into a model in a machine learning manner to obtain speakers corresponding to each sentence.

However, if the prediction result of the model is accurate, a large number of complete sample articles are needed to train the model. When the number of sample articles is small, the accuracy of the prediction result of the trained model cannot be ensured.

Disclosure of Invention

The embodiment of the application provides a method and a device for recognizing a speaker and training a speaker recognition model for text, so as to solve the problems of the related art. The technical scheme is as follows:

in a first aspect, a method for recognizing a speaker for text is provided, the method comprising:

acquiring a target dialogue in a target text, wherein the target dialogue is any dialogue of the target text;

performing speaker name recognition on adjacent sentences of the target pair of white sentences based on a speaker list corresponding to the target text, wherein the speaker list comprises at least one speaker name;

determining speaker-dependent sentences of the target pair of white sentences in the adjacent sentences based on speaker name recognition results of the adjacent sentences;

and if the speaker-dependent sentence of the target pair white sentence is determined, inputting the target pair white sentence and the speaker-dependent sentence into a trained speaker recognition model to obtain speaker information of the target pair white sentence, wherein the speaker information is a speaker name of the target pair white sentence or indication information for indicating that the target pair white sentence has no corresponding speaker name.

In a possible implementation manner, before the obtaining the target-to-white sentence in the target text, the method further includes:

the target text is divided into sentences based on the sentence tail punctuations to obtain a plurality of sentences;

for the sentences containing quotation marks in the sentences, dividing the contents in the quotation marks and the contents outside the quotation marks into different sentences;

and determining the content in the quotation marks as the dialogue sentences in the target text in all sentences obtained by dividing the target text into sentences.

In a possible implementation manner, the obtaining a target dialogue in a target text includes:

and acquiring target dialogue sentences from front to back in all dialogue sentences of the target text according to the positions in the target text.

In a possible implementation manner, the speaker name recognition of the neighboring sentence of the target pair of white sentences based on the speaker list corresponding to the target text includes:

and searching speaker names in a speaker list corresponding to the target text in adjacent sentences of the target sentence pair, and determining the adjacent sentences containing the speaker names as the adjacent sentences of the target sentence pair.

In one possible implementation, the determining, in the neighboring sentence, the speaker-dependent sentence of the target pair of white sentences based on the speaker name recognition result of the neighboring sentence includes:

determining the adjacent sentence containing the speaker name as the speaker-dependent sentence of the target spoken sentence if only one adjacent sentence in the adjacent sentences of the target spoken sentence contains the speaker name and the adjacent sentence containing the speaker name is not determined as the speaker-dependent sentence corresponding to the other spoken sentence except the target spoken sentence;

if two adjacent sentences of the target dialogue sentence contain speaker names and the adjacent sentence in front of the target dialogue sentence is not determined as the speaker-related sentence corresponding to other dialogue sentences, determining the front adjacent sentence as the speaker-related sentence of the target dialogue sentence;

if the adjacent sentences of the target dialect all contain speaker names and the adjacent sentences in front of the target dialect are determined as speaker-related sentences corresponding to other dialect sentences, the adjacent sentences in the rear of the target dialect are determined as speaker-related sentences of the target dialect;

and if the target dialogue has no adjacent sentence containing the speaker name, or the adjacent sentence containing the speaker name of the target dialogue is determined to be the speaker-dependent sentence corresponding to the other dialogue, determining that the target dialogue has no speaker-dependent sentence.

In one possible implementation, after determining the speaker-dependent sentence of the target pair of white sentences in the adjacent sentence based on the speaker name recognition result of the adjacent sentence, the method further includes:

and if the speaker-dependent sentence of the target to-white sentence is not determined, determining that the target to-white sentence has no corresponding speaker name.

In one possible implementation, the speaker recognition model is a machine-reading understanding MRC model;

the inputting the target dialogue and the speaker-dependent sentence into the trained speaker recognition model to obtain the speaker information of the target dialogue, including:

forming a question field by the target dialogue and a preset question, and forming a question-related text field by the target dialogue and the speaker-related sentence, wherein the question is used for inquiring the speaker name of the target dialogue;

and inputting the question field and the question-related text field into the MRC model to obtain the speaker information of the target sentence pair.

In a second aspect, a method of training a speaker recognition model is provided, the method comprising:

acquiring a sample pair white sentence in a sample text;

performing speaker name identification on adjacent sentences of the sample pair of white sentences based on a sample speaker list corresponding to the sample text, wherein the sample speaker list comprises at least one speaker name;

determining a sample speaker-dependent sentence corresponding to the sample pair sentence in the adjacent sentence of the sample pair sentence based on the speaker name recognition result of the adjacent sentence;

training and parameter-adjusting a speaker recognition model to be trained by taking the sample pair white sentence and the sample speaker-related sentence corresponding to the sample pair white sentence as input sample data to obtain the speaker recognition model after parameter adjustment;

and if the training parameters meet the preset end conditions, determining the speaker recognition model after parameter adjustment as the trained speaker recognition model.

In a third aspect, an apparatus for recognizing a speaker for text is provided, the apparatus comprising:

the acquisition module is used for acquiring a target dialogue in a target text;

the recognition module is used for carrying out speaker name recognition on adjacent sentences of the target pair of white sentences based on a speaker list corresponding to the target text, wherein the speaker list comprises at least one speaker name;

a determining module, configured to determine, in the adjacent sentences, speaker-dependent sentences of the target pair of white sentences based on speaker name recognition results of the adjacent sentences;

and the output module is used for inputting the target spoken sentence and the speaker-dependent sentence into a trained speaker recognition model to obtain speaker information of the target spoken sentence if the speaker-dependent sentence of the target spoken sentence is determined, wherein the speaker information is a speaker name of the target spoken sentence or indication information for indicating that the target spoken sentence does not have a corresponding speaker name.

In a possible implementation manner, the determining module is further configured to:

and determining the content in the quotation marks as the dialogue in the target text in all sentences obtained by segmenting the target text.

In a possible implementation manner, the obtaining module is configured to:

In one possible implementation manner, the identification module is configured to:

In one possible implementation manner, the determining module is configured to:

if the adjacent sentences of the target pair of white sentences all contain speaker names and the adjacent sentences in front of the target pair of white sentences are determined as speaker-dependent sentences corresponding to other pairs of white sentences, determining the adjacent sentences in back of the target pair of white sentences as speaker-dependent sentences of the target pair of white sentences;

and determining that the target sentence pair has no speaker-dependent sentence if the target sentence pair does not have an adjacent sentence containing the speaker name, or if the adjacent sentence containing the speaker name of the target sentence pair has been determined as a speaker-dependent sentence corresponding to another sentence pair.

and if the speaker-dependent sentence of the target sentence pair is not determined, determining that the target sentence pair has no corresponding speaker name.

the output module is configured to: forming a question field by the target dialogue and a preset question, and forming a question-related text field by the target dialogue and the speaker-related sentence, wherein the question is used for inquiring the speaker name of the target dialogue;

In a fourth aspect, an apparatus for training a speaker recognition model is provided, the apparatus comprising:

the acquisition module is used for acquiring sample dialogue in the sample text;

the identification module is used for carrying out speaker name identification on adjacent sentences of the sample pair of white sentences based on a sample speaker list corresponding to the sample text, wherein the sample speaker list comprises at least one speaker name;

a determining module, configured to determine, based on speaker name recognition results of the adjacent sentences, a sample speaker-dependent sentence corresponding to the sample pair of sentences from the adjacent sentences of the sample pair of sentences;

the training module is used for training and parameter-adjusting a speaker recognition model to be trained by taking the sample pair white sentence and the sample speaker-related sentence corresponding to the sample pair white sentence as input sample data to obtain the speaker recognition model after parameter adjustment;

and the ending module is used for determining the speaker recognition model after parameter adjustment as the speaker recognition model after training if the training parameter adjustment meets the preset ending condition.

In a fifth aspect, a computer device is provided, which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement the first aspect and its possible implementations and the second aspect and its possible implementations.

In a sixth aspect, there is provided a computer readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to implement the method of the first aspect and its possible implementations and the second aspect and its possible implementations.

In a seventh aspect, a computer program product is provided, which includes at least one instruction that is loaded and executed by a processor to implement the method of the first aspect and its possible implementations and the second aspect and its possible implementations.

The beneficial effects that technical scheme that this application embodiment brought include at least:

in the embodiment of the application, the dialogue sentence and the corresponding speaker-dependent sentence are obtained from the target text and input into the speaker identification model to obtain the corresponding speaker information. In this way, speaker recognition is performed, and the speaker recognition model is recognized in sentence units, and the samples required for the corresponding training are also in sentence units. Usually, there are many spoken sentences and speaker-dependent sentences in one article, so that only a small number of articles are needed to obtain a large number of spoken sentences and speaker-dependent sentences. Therefore, when the number of sample articles is small, the method can improve the recognition accuracy of the speaker recognition model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for recognizing a speaker for text according to an embodiment of the present application;

fig. 2 is a schematic diagram of a process of speaker name recognition of neighboring sentences of a target pair white sentence in a target text according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a process for determining speaker-dependent sentences of a target sentence pair in adjacent sentences according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for inputting a target sentence pair and a speaker-dependent sentence into a trained speaker recognition model according to an embodiment of the present application;

FIG. 5 is a flow chart of a method for training a speaker recognition model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for recognizing a speaker for a text according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for training a speaker recognition model according to an embodiment of the present application;

fig. 8 is a block diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Embodiments of the present application provide a method for recognizing a speaker for a text, which may be implemented by a computer device. The computer device may be a server or a terminal or the like. The server may be a single server or a server cluster composed of a plurality of servers. The terminal may be a desktop computer, a notebook computer, a mobile phone, a tablet computer, etc.

The computer device may comprise a processor, a memory, a communication component, etc., to which the processor is connected, respectively.

The processor may be a Central Processing Unit (CPU). The processor may be configured to read text content and process data, such as sentence splitting the target text, determining a sentence pair, determining a speaker-dependent sentence, determining speaker information corresponding to the sentence pair based on the sentence pair and the speaker-dependent sentence, training a speaker recognition model to be trained, and so on.

The Memory may include a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic disk, an optical data storage device, and the like. The memory may be used for data storage, for example, to store pre-stored data required for sentence segmentation of the target text, to store intermediate data generated during sentence segmentation of the target text, to store the obtained sentence and the corresponding speaker-dependent sentence, to store intermediate data generated during sentence and the corresponding speaker-dependent sentence, to store pre-stored data required for training the speaker recognition model to be trained, and so on.

The communication means may be a wired network connector, a WiFi (Wireless Fidelity) module, a bluetooth module, a cellular network communication module, etc. The communication means may be used for receiving and transmitting signals.

Fig. 1 is a flowchart of a method for recognizing a speaker for text according to an embodiment of the present application. Referring to fig. 1, the embodiment includes:

101. and performing sentence division processing on the target text, and determining all the white sentences in the target text in the plurality of sentences obtained after the sentence division processing.

102. And acquiring a target pair white sentence in the target text.

103. And carrying out speaker name recognition on adjacent sentences of the target pair of white sentences based on the speaker list corresponding to the target text.

104. Speaker-dependent sentences of the target pair of white sentences are determined among the adjacent sentences based on the speaker name recognition results of the adjacent sentences.

105. And inputting the target sentence pair and the speaker-related sentence into the trained speaker recognition model to obtain the speaker information of the target sentence pair.

The following describes the specific process of speaker recognition for text in detail:

in step 101, the computer device performs sentence segmentation on the target text, and determines a dialog white sentence in the target text among a plurality of sentences obtained after the sentence segmentation. The target text may be text containing a Chinese sentence, such as a novel text, a narrative text, a script text, and the like. The sentence splitting processing means that the target text is split into a plurality of sentences based on punctuations in the target text.

In implementation, the initial text may be preprocessed to obtain the target text. Where the initial text may be the original unprocessed article text, there may be some low-level errors that need to be pre-processed to remove them. The preprocessing may be to remove redundant punctuation, erroneous punctuation, and other symbols that should not appear in the original text, etc. Redundant punctuation marks may be a plurality of identical punctuation marks (e.g., commas, pause signs) used consecutively, etc., erroneous punctuation marks may be a plurality of commas used together as ellipses or commas used at the end of a segment, etc., and symbols that should not appear may be spaces, etc. The technician may first collect these common low-level errors and the modification rules corresponding to each low-level error, and store these low-level errors and the corresponding modification rules in the computer device in advance. And traversing the initial text, and when certain error content is traversed, correspondingly modifying the error content based on a modification rule corresponding to the error content. For the problem of redundant punctuation marks, the redundant punctuation marks can be deleted. For example, if there are two consecutive periods, one of the periods is deleted, etc. For problems with the use of wrong punctuation marks, the wrong punctuation mark can be replaced with the correct punctuation mark. For example, a plurality of consecutive commas are replaced with ellipses or the like. For the problem that there are symbols that should not appear, these symbols that should not appear can be deleted directly. For example, deleting a space in a text, etc. Therefore, the target text has no low-level errors, so that the computer equipment cannot be influenced by the low-level errors when the computer equipment carries out corresponding processing on the related text, and the accuracy of the computer equipment for carrying out subsequent processing on the text can be improved.

After the target text is obtained, sentence splitting processing can be performed on the target text, and then all the dialogue in the target text are determined, and the specific process is as follows:

firstly, based on the position of the sentence end punctuation in the target text, the target text is divided into a plurality of sentences. The sentence end punctuation marks can be punctuation marks such as periods, exclamation marks, question marks, ellipses and the like.

In implementation, the technician may first collect all end-of-sentence punctuation mark types and pre-store all end-of-sentence punctuation mark types in the computer device. The target text is then traversed in a front-to-back order. When the sentence end punctuation in the target text is traversed, the currently traversed sentence end punctuation can be numbered. If the currently traversed sentence end punctuation is the first sentence end punctuation in the target text, the currently traversed sentence end punctuation and the preceding text can be determined as a sentence. If the currently traversed sentence end punctuation is not the first sentence end punctuation in the target text, the currently traversed sentence end punctuation, and the text after the currently traversed sentence end punctuation and the last traversed sentence end punctuation can be determined as a sentence. In this way, after the preliminary sentence segmentation processing is performed on the target text, a plurality of sentences can be obtained, and the number of the plurality of sentences can be N, where N is a positive integer.

And secondly, for the sentences containing quotation marks in the sentences, dividing the contents in the quotation marks and the contents outside the quotation marks into different sentences. Wherein, the quotation marks can be double quotation marks, including front quotation marks and back quotation marks.

In implementation, after the preliminary sentence segmentation processing is performed on the target text, N sentences may be obtained, and among the N sentences, there may be T sentences including quotation marks (T is a positive integer, and T is smaller than N). Further clause processing may be performed on the T sentences containing quotation marks. In the T sentences containing double quotation marks, the antecedent, postamble, and text within the antecedent and postamble may be divided into separate sentences based on the position of the double quotation marks. If text exists in addition to the double quotation marks in the sentence containing the double quotation marks, the text before the front quotation marks and the text after the rear quotation marks are divided into separate sentences. In this way, after the target text is further processed by the clause, all sentences of the target text can be obtained, and the number of the sentences can be M, where M is a positive integer and M is greater than N.

And thirdly, determining sentences corresponding to the contents in the double quotation marks as dialogue sentences in the target text in all sentences obtained by sentence splitting of the target text.

In implementation, the computer device may record sentence identifiers (the sentence identifiers may be sequence numbers or position information of the double quotation marks in the text, etc.) of the sentences corresponding to the contents in the double quotation marks, and use the sentence identifiers as sentence identifiers of the white sentences. Optionally, after the computer device determines all the dialog sentences in the target text, each dialog sentence may be numbered according to the sequence of the dialog sentence in the target text, so as to facilitate subsequent invocation.

In step 102, a target sentence pair in the target text is obtained.

The target sentence pair may be any of all sentence pairs. The target dialog is the dialog currently required for speaker name recognition. In implementation, after all the dialog sentences of the target text are determined, the dialog sentences can be acquired one by one according to the positions of all the dialog sentences in the target text from front to back, and the speaker name recognition processing of the subsequent step can be carried out.

In step 103, speaker name recognition is performed on neighboring sentences of the target pair white sentence in the target text based on the speaker list corresponding to the target text.

Wherein the list of speakers comprises at least one speaker name, which is the name of all speakers in the target text. The speaker list may be a list of names of all characters in the novel text or a list of names of all actors in the script text. Speaker name recognition refers to finding a speaker name in a sentence.

In an implementation, the list of speakers may be pre-stored in the computer device. The computer device may determine the position of the target spoken sentence first, and accordingly, among the adjacent sentences of the target spoken sentence, may determine an adjacent sentence positioned in front of the target spoken sentence as a first adjacent sentence, and determine an adjacent sentence positioned in back of the target spoken sentence as a second adjacent sentence. Then, the content of the first adjacent sentence and the second adjacent sentence is subjected to word segmentation processing. Then, the computer device compares each word of the first adjacent sentence with a speaker name in the speaker list to determine whether the first adjacent sentence includes the corresponding speaker name, and compares each word of the second adjacent sentence with the speaker name in the speaker list to determine whether the second adjacent sentence includes the corresponding speaker name. As shown in fig. 2, a specific example is given, after the target sentence pair is obtained, speaker name recognition is performed on adjacent sentences of the target sentence pair.

Alternatively, if the computer device finds the same person name as in the speaker list in the first adjacent sentence or the second adjacent sentence of the target sentence, the person name in the first adjacent sentence or the second adjacent sentence may be replaced with a special character. Wherein the special character may be a character having a reference number. For example, PERSON i, i is a positive integer.

In implementation, the preset character PERSON may be stored in advance in a word bank of the computer device. For example, the preset character PERSON may be stored in advance in a Bert (thesaurus name) thesaurus of the computer device. The computer device may read the list of speakers before proceeding to step 103, numbering each different speaker name in order in the list of speakers. For example, the speaker names Xiaoming, xiaohong, \8230;, xiaogang, etc. are numbered 1, 2, \8230;, i, in that order. Wherein i is a positive integer. Then, the preset character PERSON is combined with the number corresponding to the speaker name to form a special character PERSON i. According to the one-to-one correspondence between different speaker names and the special characters PERSON i, a speaker name-special character correspondence table is constructed. The visible special characters are used to uniquely identify the speaker, and thus the special characters may also be referred to as speaker unique identification.

TABLE 1

Name(s)	Special characters
		Xiaoming	PERSON
1
	Xiao Hong	PERSON 2
Small steel			PERSON 3
	……	……

After assigning the special characters to the speaker names in the speaker list, the computer device performs speaker name recognition on neighboring sentences of the target sentence. When the computer device performs speaker name recognition on adjacent sentences of the target sentence, if a certain speaker name is recognized, the special character corresponding to the speaker name can be searched in the correspondence table. Then, the speaker name in this adjacent sentence is replaced with the found special character. For example, if the speaker name of "xiao ming" is recognized in the adjacent sentence of the target pair of sentences, and the special character corresponding to the speaker name of "xiao ming" is found to be PERSON 1 according to the correspondence table, the speaker name of "xiao ming" can be replaced with the special character PERSON 1.

In step 104, speaker-dependent sentences for the target pair of white sentences are determined among the adjacent sentences based on the recognition result of speaker name recognition.

The speaker-dependent sentence is a sentence reflecting the speaker name of the spoken sentence, and is generally located in front of or behind the spoken sentence and adjacent to the spoken sentence. As shown in fig. 3, a specific example is given, in which one of neighboring sentences of the target sentence pair is determined as a speaker-dependent sentence of the target sentence pair.

Several possible recognition results and corresponding speaker-dependent sentence determination are given below:

(1) And if only one of the adjacent sentences of the first pair of sentences contains the speaker name and the adjacent sentence containing the speaker name is not determined to be the speaker-dependent sentence corresponding to the other pair of sentences except the first pair of sentences, determining the adjacent sentence containing the speaker name to be the speaker-dependent sentence of the first pair of sentences.

Wherein, only one adjacent sentence in the adjacent sentences of the target sentence pair contains the speaker name, which comprises the following two conditions: the target sentence pair in the target text only has one adjacent sentence, and the adjacent sentence contains the name of the speaker; the target sentence has two adjacent sentences, but only one adjacent sentence contains the speaker name. The target sentence in the target text has only one neighboring sentence, and the neighboring sentence includes the speaker name, which may be the case when the target sentence is the first sentence or the last sentence of the target text. The target spoken sentence has two adjacent sentences, but only one adjacent sentence includes the speaker name, and the target spoken sentence may be a non-initial sentence and a non-final sentence of the target text.

In an implementation, in the case where the target sentence pair has only one neighboring sentence, if the neighboring sentence includes the speaker name and the neighboring sentence is not determined as a speaker-related sentence corresponding to another sentence pair, the neighboring sentence is determined as the speaker-related sentence of the target sentence pair.

In the case where the target sentence pair has two adjacent sentences, if only one of the first adjacent sentence and the second adjacent sentence contains the speaker name and the adjacent sentence is not determined as the speaker-related sentence corresponding to the other sentence pair, the adjacent sentence is determined as the speaker-related sentence of the target sentence pair. The first adjacent sentence may be an adjacent sentence positioned in front of the target spoken sentence, and the second adjacent sentence may be an adjacent sentence positioned in the rear of the target spoken sentence.

(2) If two adjacent sentences of the target sentence pair both contain the speaker name and the adjacent sentence in front of the target sentence pair is not determined as the speaker-dependent sentence corresponding to the other sentence pair, the front adjacent sentence is determined as the speaker-dependent sentence of the target sentence pair.

Wherein the adjacent sentence in front of the target pair white sentence is simply referred to as the first adjacent sentence hereinafter.

In practice, there may be different precedence between the target pair of white sentences and the adjacent sentences having speaker names. The neighboring sentence with the speaker name may precede the target sentence or follow the target sentence. The two may not be distinguished in the above sense, but the determination result may be different when determining the speaker-dependent sentence of the target pair white sentence. Typically, adjacent sentences with speaker names precede the target sentence in a more customary order of writing. When the neighboring sentence with the speaker name precedes the target spoken sentence, then the neighboring sentence with the speaker name and the target spoken sentence may appear as: [ Small sitting on a chair. The small description is as follows: "is today a work day? ". The small steel dots indicate yes. [ MEANS FOR solving PROBLEMS ] is provided. When the neighboring sentence with the speaker name follows the target sentence, the target sentence and the neighboring sentence with the speaker name may be expressed as: [ Small sitting on a chair. "is today a work day? "Xiao Dian. The small steel dots indicate yes. [ MEANS FOR solving PROBLEMS ] is provided. In both cases, when [ is today a work day? "is the speaker-dependent sentence of the target-to-sentence when the target-to-sentence is" [ Xiaoming: or [ Xiao Bing (for short description of the invention) ]. [ MEANS FOR solving PROBLEMS ] is provided. Since the first case is a sequence that is more consistent with writing habits. Thus, if two adjacent sentences of the target sentence pair both contain the speaker name, and the first adjacent sentence of the target sentence pair is not determined to be the speaker-dependent sentence to which the other sentence pair corresponds, the first adjacent sentence is determined to be the speaker-dependent sentence of the target sentence pair. In this way, the probability that the target sentence and the speaker-dependent sentence correctly correspond to each other can be increased.

(3) If the adjacent sentences of the target sentence pair all contain the speaker names and the adjacent sentences in front of the target sentence pair have been determined to be speaker-dependent sentences corresponding to other sentence pairs, the adjacent sentences in the rear of the target sentence pair are determined to be speaker-dependent sentences of the target sentence pair.

Wherein the adjacent sentence at the rear of the target dialog is simply referred to as the second adjacent sentence hereinafter.

In practice, it is more customary for the sentence containing the speaker name to precede the spoken sentence, but there may be cases where the sentence containing the speaker name is located after the spoken sentence in the target text. For example, [ today weather really good! "Small instructions. "to the shell, wait for us to go out and play the bar! "Small red laughing saying. [ solution ] A. In this case, when [ the weather today is really good! "when a target sentence is a target sentence, it is possible that only one neighboring sentence of the target sentence contains the speaker name, and then [ Xiaoming ] is determined as a speaker-dependent sentence. So, when [ on, etc. we go out to play bar! "" when the target sentence pair is taken, the first adjacent sentence and the second adjacent sentence of the target sentence pair both contain the speaker name, but since the first adjacent sentence has been determined to be [ today weather is really good! ", the second adjacent sentence of the target sentence is determined as the speaker-related sentence of the target sentence. In this way, it is possible to avoid the speaker-dependent sentence for which the same adjacent sentence is identified as a plurality of spoken sentences, and to improve the probability that the target spoken sentence and the speaker-dependent sentence correctly correspond to each other.

(4) If the target sentence pair does not have a neighboring sentence containing the speaker name, or if the neighboring sentence of the target sentence pair containing the speaker name has been determined to be a speaker-dependent sentence corresponding to another sentence pair, it is determined that the target sentence pair does not have a speaker-dependent sentence.

In practice, there may be situations where there is no dialog for a particular speaker. In this case, it is necessary to determine that the target pair is free of speaker-dependent sentences. A spoken sentence without a specific speaker may be: after the performance is finished, a fierce clapping sound is sounded in the auditorium, and the performance is really good! ". "thank you all! "Xiaoming Ju went over the bowing lane. [ MEANS FOR solving PROBLEMS ] is provided. When "the true nature of the performance! "when the target sentence is a target sentence, the target sentence has no adjacent sentence containing the speaker name, and in this case, it is determined that the target sentence has no speaker-dependent sentence.

In implementation, there may also be a case where the neighboring sentence of the target sentence pair, which contains the speaker name, has been determined as a speaker-dependent sentence corresponding to another sentence pair. In this case, it is necessary to determine that the target pair is free of speaker-dependent sentences. The case where the neighboring sentence of the target sentence pair containing the speaker name has been determined to be the speaker-dependent sentence corresponding to the other sentence pair may be: after the performance is finished, a fierce palmar sound is sounded in the auditorium, and people are thanks! "Xiaoming Ju went over the bowing lane. "good! Good! Good! ", the array is called well under the station. [ MEANS FOR solving PROBLEMS ] is provided. In this case, when "thank you for you! "As a target sentence," a Xiaoming bow "is determined as a speaker-dependent sentence. So, when "good! Good! Good! "when the target sentence is a target sentence, the adjacent sentence of the target sentence containing the speaker name is already determined as the speaker-dependent sentence corresponding to the other sentence. Thus, it is determined that the target speaker-dependent sentence is not present for the white sentence.

In this way, when the target sentence pair does not have an adjacent sentence including the speaker name, or when the adjacent sentence including the speaker name of the target sentence pair has been determined as a speaker-dependent sentence corresponding to another sentence pair, it is possible to determine that the target sentence pair does not have a speaker-dependent sentence, and further determine that the sentence pair does not have a speaker name corresponding thereto, instead of determining an incorrect speaker name for the target sentence pair.

Alternatively, in the process of specifying the speaker-dependent sentence of the target pair of sentences, if the speaker-dependent sentence of the target pair of sentences is not specified, it is specified that the speaker name of the target pair of sentences does not correspond to the speaker-dependent sentence of the target pair of sentences. If the speaker-dependent sentence of the target pair of white sentences is determined, step 105 is performed.

In step 105, the target sentence and the speaker-dependent sentence are input to the trained speaker recognition model, and speaker information of the target sentence is obtained.

The speaker recognition model can be a trained machine learning model, and the specific algorithm adopted by the machine learning model can be set according to the actual requirement. The speaker recognition model may be a Machine Reading Comprehension (MRC) model, and the MRC model is a question-answer model in text form. The speaker information may be a speaker name of the target sentence or indication information indicating that the target sentence does not have a corresponding speaker name.

In implementation, the target-pair white sentence may be determined as Q (Quote), the Speaker-dependent sentence may be determined as S (Speaker), and then the target-pair white sentence and the Speaker-dependent sentence may be determined as one Q S pair or S Q pair. After determining a qs or sq pair, the determined qs or sq pair may be input to the trained speaker recognition model.

For the case of using the MRC model, the target dialogue and the preset question may be combined into a question field, and the target dialogue and the speaker-dependent sentence may be combined into a question-related text field. The question is used to ask the speaker name of the target pair of the white sentence, and the content of the question may be preset, for example, "who said the sentence? ". The question field and question related text field are fields required by the MRC model. After the question field and the question related text field are determined, an identifier may be added in front of the question field to indicate that a question is to be asked for the machine learning model, e.g., an identifier < CLS > is added in front of the question field. A separator may then be added between the question field and the question-related text field, and to the rear of the question-related text field, to distinguish the question field from the question-related text field, e.g. a separator < SEP > is added between the question field and the question-related text field, and to the rear of the question-related text field. After adding the above-mentioned identifier and delimiter to the question field and question-related text field, the input data is obtained, as shown in fig. 4. And finally, inputting the input data into the trained speaker recognition model to obtain the speaker information of the target pair of the white sentences.

After the speaker recognition model is subjected to corresponding prediction processing, speaker information of the target pair of the white sentences can be output. After completing step 105, the target sentence may be labeled based on the speaker information.

Labeling method one

And determining marking information based on the speaker information, and recording the marking information on the corresponding target corresponding to the white sentence. When a speaker exists in the target sentence, the content of the label information may be the speaker name of the target sentence or a special character corresponding to the speaker name. When the speaker does not exist for the target to white sentence, the content of the annotation information may be indication information for indicating that the speaker name for the target to white sentence does not correspond, for example, 00000000.

Labeling mode two

A correspondence table between the sentence numbers and the speaker names is pre-established, for example, as shown in table 2, the correspondence table includes all the speaker names in the speaker list of the target text, and the correspondence table may further include an entry of "no speaker name". After determining speaker information of the target spoken sentence, the sequence numbers, such as 1, 2, 3, 8230, of the target spoken sentence in all spoken sentences of the target text are obtained. If the speaker information is the speaker name, the sequence number of the target sentence is recorded in association with the speaker name in the correspondence table. If the speaker information is indication information for indicating that the target sentence does not have a speaker name corresponding thereto, the sequence number of the target sentence is recorded in correspondence with the speaker-free name in the correspondence table.

TABLE 2

Number the white sentence	Speaker name
			1、21……	Speaker-less name
2、4、6、15、17、19……	Xiao Hong
		3、5、7、8、24、26……	Small steel
……	……

After determining the speaker names for all of the spoken sentences in the target text, voiced reading audio may be generated based on whether each sentence in the target text is a spoken sentence, whether each spoken sentence has a corresponding speaker name, a speaker name corresponding to each spoken sentence, and so on. For sentences in the target text that are not spoken sentences, a first timbre synthesized audio may be used to indicate that the sentences in the target text are spoken sentences. For spoken sentences in the target text, when there is no corresponding speaker name for the spoken sentence, a second timbre synthesized audio may be used to indicate that the sentences in the target text are spoken sentences without a particular speaker. When a sentence has a corresponding speaker name, audio may be synthesized using different timbres based on the different speaker names for distinguishing the sentence content of each particular speaker in the target text. For example, for spoken sentences in which the speaker name is reddish, the third timbre synthesized audio is used, for spoken sentences in which the speaker name is bright, the fourth timbre synthesized audio is used, and so on. Therefore, when the voice book is read clearly, the reading mode of dubbing of multiple speakers can be realized, and the experience of the user is further improved.

The embodiment of the present application provides a method for training a speaker recognition model, and fig. 5 is a flowchart of the method for training the speaker recognition model according to the embodiment of the present application. Referring to fig. 5, the embodiment includes:

501. and acquiring sample pair white sentences in the sample text.

The sample text refers to a text for training the model. The sample sentence pair may be any pair of sentences in the sample text.

Before step 501, the initial text may be preprocessed and sentence-divided to obtain a sample text, and then a sentence in the sample text is determined. The process of preprocessing and sentence splitting for the initial text and the process of determining the sentence in the sample text are the same as the process of step 101, and the description is not repeated here.

In implementation, after all the spoken sentences in the sample text are obtained, speaker name recognition processing of subsequent steps can be performed on each sample spoken sentence sequentially according to the positions of all the spoken sentences in the sample text and the sequence from front to back.

502. And carrying out speaker name identification on adjacent sentences of the sample pair white sentences in the sample text based on the sample speaker list corresponding to the sample text.

Wherein the list of sample speakers comprises at least one speaker name, which is the name of all speakers in the sample text. The sample speaker list may be a list composed of names of all characters in the sample novel text, or may be a list composed of names of all actors in the sample script text. Speaker name recognition refers to finding a speaker name in a sentence.

In an implementation, the list of speakers may be pre-stored in the computer device. The computer device may determine the position of the target spoken sentence first, and accordingly, among the adjacent sentences of the target spoken sentence, may determine an adjacent sentence positioned in front of the target spoken sentence as a first adjacent sentence, and determine an adjacent sentence positioned in back of the target spoken sentence as a second adjacent sentence. Then, the content of the first adjacent sentence and the second adjacent sentence is subjected to word segmentation processing. The computer device then compares words of the first adjacent sentence to speaker names in the speaker list to determine whether the first adjacent sentence includes a corresponding speaker name, and compares words of the second adjacent sentence to speaker names in the speaker list to determine whether the second adjacent sentence includes a corresponding speaker name.

503. And determining a sample speaker-dependent sentence corresponding to the sample pair sentence in the adjacent sentences of the sample pair sentences based on the speaker name recognition results of the adjacent sentences.

(1) And if only one adjacent sentence in the adjacent sentences of the target sentence pair contains the speaker name and the adjacent sentence is not determined as the speaker-related sentence corresponding to other sentence pairs, determining the adjacent sentence as the speaker-related sentence of the target sentence pair.

(2) If two adjacent sentences of the target pair of sentences each contain a speaker name and adjacent sentences in front of the target pair of sentences are not determined to be speaker-dependent sentences corresponding to other pairs of sentences, the front adjacent sentences are determined to be speaker-dependent sentences of the target pair of sentences.

The above possibilities of several recognition results are the same as those in step 104, and are not repeated here.

504. And training and parameter-adjusting the speaker recognition model to be trained by taking the sample pair white sentence and the sample speaker-related sentence corresponding to the sample pair white sentence as input sample data to obtain the speaker recognition model after parameter adjustment.

In implementation, a technician may first determine the sample speaker-dependent sentence and the sample pair sentence as input sample data, and based on the contents of the sample speaker-dependent sentence and the sample pair sentence, first mark a correct speaker name or indication information (in a case where there is no corresponding speaker name) for the input sample data as corresponding reference speaker information. Then, the input sample data is input into a speaker recognition model to be trained, and predicted speaker information is obtained. Then, the predicted speaker information and the reference speaker information may be input to a predetermined loss function, and a corresponding loss value may be calculated. And further calculating an adjusting value corresponding to each parameter to be adjusted in the speaker identification model to be trained based on the loss value, and adjusting the corresponding parameter to be adjusted based on the adjusting value to obtain the speaker identification model after parameter adjustment.

505. And if the training parameters meet the preset end conditions, determining the speaker recognition model after parameter adjustment as the trained speaker recognition model.

In implementation, a technician may train and tune the speaker recognition model to be trained based on the calculated loss value, so that the speaker prediction result output by the speaker recognition model to be trained is more accurate, and thus the speaker recognition model with accurate output is obtained. Using unused samples to train the white sentences and the corresponding sample speaker-associated sentences for multiple times, stopping training until a preset finishing condition is reached, wherein the preset finishing condition can be various, and the following conditions meeting the preset finishing condition are given:

(1) And (3) using different samples to perform training on the white sentence and the corresponding sample speaker-related sentence, wherein the training times of the speaker identification model to be trained reach a training time threshold.

The training number threshold may be any reasonable value, for example, 5000, 10000, or the like, which is not limited in this embodiment of the present application. In practice, the technician may set a training time threshold in advance, and when the training time reaches the training time threshold, the training may be stopped, and the speaker recognition model obtained after the last training may be determined as the trained speaker recognition model.

(2) The loss value in the continuous preset number of times of training is smaller than the preset loss value threshold value.

The preset number and the preset loss value threshold may be any reasonable values, for example, the preset number may be 50 or 100, which is not limited in the embodiments of the present application. In practice, the speaker recognition model under training is adjusted by parameters one time, and the loss value between the output predicted speaker information and the reference speaker information is gradually reduced in the overall trend. If the loss value in a predetermined number of consecutive training is smaller than the predetermined loss value threshold after a certain number of training (for example, 10000 training), the current model is considered as a trained speaker recognition model.

(3) In a continuous preset number of training, the model identification accuracy rate exceeds a preset accuracy rate threshold value.

The preset number and the preset accuracy threshold may be any reasonable values, for example, the preset number may be 50 or 100, and the like, and the preset accuracy threshold may be 90%, and the like, which is not limited in this embodiment of the present application.

In implementation, the speaker recognition model in training is adjusted through parameters one time after another, and the recognition accuracy of the model gradually increases in the overall trend. If the model recognition accuracy exceeds the preset accuracy threshold in a preset number of consecutive training times after a certain number of training (for example, after 1000 training times), the current model is considered as a trained speaker recognition model.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

The technical scheme provided by the embodiment of the application has the following beneficial effects: in the embodiment of the application, the dialogue sentence and the corresponding speaker-dependent sentence are obtained from the target text and input into the speaker identification model to obtain the corresponding speaker information. In this way, speaker recognition is performed, and the speaker recognition model is recognized in sentence units, and the samples required for the corresponding training are also in sentence units. Usually, there are many spoken sentences and speaker-dependent sentences in one article, so that only a small number of articles are needed to obtain a large number of spoken sentences and speaker-dependent sentences. Therefore, when the number of sample articles is small, the method can improve the recognition accuracy of the speaker recognition model.

The embodiment of the present application provides an apparatus for recognizing a speaker for text, where the apparatus may be a computer device in the foregoing embodiment, and as shown in fig. 6, the apparatus includes:

an obtaining module 610, configured to obtain a target sentence pair in a target text;

an identifying module 620, configured to perform speaker name identification on neighboring sentences of the target pair of white sentences based on a speaker list corresponding to the target text, where the speaker list includes at least one speaker name;

a determining module 630, configured to determine, based on the speaker name recognition result of the neighboring sentence, a speaker-dependent sentence of the target pair of white sentences in the neighboring sentence;

an output module 640, configured to, if the speaker-dependent sentence of the target spoken sentence is determined, input the target spoken sentence and the speaker-dependent sentence into a trained speaker identification model to obtain speaker information of the target spoken sentence, where the speaker information is a speaker name of the target spoken sentence or indication information used to indicate that the target spoken sentence has no corresponding speaker name.

In a possible implementation manner, the obtaining module 610 is configured to:

In a possible implementation manner, the obtaining module 610 is further configured to:

In one possible implementation manner, the identifying module 620 is configured to:

In a possible implementation manner, the determining module 630 is configured to:

In a possible implementation manner, the determining module 630 is further configured to:

the output module 640 is configured to: forming a question field by the target dialogue and a preset question, and forming a question-related text field by the target dialogue and the speaker-related sentence, wherein the question is used for inquiring the speaker name of the target dialogue;

The embodiment of the present application provides an apparatus for training a speaker recognition model, where the apparatus may be a computer device in the above embodiment, as shown in fig. 6, the apparatus includes:

an obtaining module 710, configured to obtain a sample pair white sentence in a sample text;

an identifying module 720, configured to perform speaker name identification on neighboring sentences of the sample pair of white sentences based on a sample speaker list corresponding to the sample text, where the sample speaker list includes at least one speaker name;

a determining module 730, configured to determine, based on the speaker name recognition result of the adjacent sentence, a sample speaker-dependent sentence corresponding to the sample pair of sentences from the adjacent sentence of the sample pair of sentences;

the training module 740 is configured to train and tune parameters of a speaker recognition model to be trained by using the sample pair white sentence and the sample speaker-dependent sentence corresponding to the sample pair white sentence as input sample data, so as to obtain a speaker recognition model after tuning parameters;

and an ending module 750, configured to determine the speaker recognition model after parameter adjustment as the speaker recognition model after training if the training parameters satisfy the preset ending condition.

It should be noted that: in the device for recognizing a speaker in text according to the above embodiment, when executing functions, the division of the above functional modules is merely exemplified, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the device for recognizing a speaker in a text and the method for recognizing a speaker in a text provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one or more memories 802, where the memory 802 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 801 to implement the method provided by the foregoing method embodiments. Certainly, the server may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the server may further include other components for implementing functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, comprising instructions executable by a processor in a terminal to perform the method of controlling a drone of the above embodiments. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.), and signals (including but not limited to signals transmitted between a user terminal and other equipment, etc.) referred to in the present application are authorized by a user or are sufficiently authorized by various parties, and the collection, use, and processing of the relevant data need to comply with relevant laws and regulations and standards in relevant countries and regions.

The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of identifying a speaker for text, the method comprising:

and if the speaker-dependent sentence of the target pair of white sentences is determined, inputting the target pair of white sentences and the speaker-dependent sentence into a trained speaker recognition model to obtain speaker information of the target pair of white sentences, wherein the speaker information is the speaker name of the target pair of white sentences or indication information for indicating that the target pair of white sentences has no corresponding speaker name.

2. The method of claim 1, wherein before obtaining the target pair sentence in the target text, the method further comprises:

3. The method of claim 1, wherein the obtaining the target spoken sentence in the target text comprises:

4. The method of claim 1, wherein performing speaker name recognition on neighboring sentences of the target pair of white sentences based on the speaker list corresponding to the target text comprises:

5. The method of claim 4, wherein said determining the speaker-dependent sentence of the target pair of sentences among the neighboring sentences based on the speaker name recognition results of the neighboring sentences comprises:

if only one adjacent sentence in the adjacent sentences of the target pair of white sentences contains the speaker name and the adjacent sentence containing the speaker name is not determined as the speaker-dependent sentence corresponding to the other pair of white sentences except the target pair of white sentences, determining the adjacent sentence containing the speaker name as the speaker-dependent sentence of the target pair of white sentences;

if two adjacent sentences of the target dialogue sentence contain speaker names and the adjacent sentences in the front of the target dialogue sentence are not determined as speaker-related sentences corresponding to other dialogue sentences, determining the front adjacent sentences as speaker-related sentences of the target dialogue sentence;

6. The method of any of claims 1-5, wherein after determining the speaker-dependent sentence of the target pair of sentences in the adjacent sentences based on the speaker name recognition results of the adjacent sentences, the method further comprises:

7. The method of any of claims 1-5, wherein the speaker recognition model is a machine-reading understanding (MRC) model;

inputting the target sentence pair and the speaker-dependent sentence into a trained speaker recognition model to obtain speaker information of the target sentence pair, including:

forming a question field by the target dialogue sentence and a preset question, and forming a question-related text field by the target dialogue sentence and the speaker-related sentence, wherein the question is used for inquiring the speaker name of the target dialogue sentence;

8. A method of training a speaker recognition model, the method comprising:

acquiring a sample pair white sentence in a sample text;

training and parameter-adjusting a speaker recognition model to be trained by taking the sample pair white sentence and the sample speaker-related sentence corresponding to the sample pair white sentence as input sample data to obtain a speaker recognition model after parameter adjustment;

9. An apparatus for recognizing a speaker for text, the apparatus comprising:

and the output module is used for inputting the target sentence pair and the speaker-related sentence into a trained speaker recognition model to obtain speaker information of the target sentence pair if the speaker-related sentence of the target sentence pair is determined, wherein the speaker information is a speaker name of the target sentence pair or indication information for indicating that the target sentence pair does not have a corresponding speaker name.

10. An apparatus for training a speaker recognition model, the apparatus comprising:

the acquisition module is used for acquiring sample pair white sentences in the sample text;

a determining module, configured to determine, based on speaker name recognition results of the adjacent sentences, a sample speaker-dependent sentence corresponding to the sample pair sentence in the adjacent sentences of the sample pair sentence;

and the ending module is used for determining the speaker recognition model after parameter adjustment as the speaker recognition model after training is finished if the training parameter adjustment meets the preset ending condition.

11. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to perform operations performed by the method of any one of claims 1 to 8.

12. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by a method according to any one of claims 1 to 8.

13. A computer program product comprising at least one instruction which is loaded and executed by a processor to perform the operations performed by the method of any one of claims 1 to 8.